VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model (2403.05346v3)
Abstract: In the field of Class Incremental Object Detection (CIOD), creating models that can continuously learn like humans is a major challenge. Pseudo-labeling methods, although initially powerful, struggle with multi-scenario incremental learning due to their tendency to forget past knowledge. To overcome this, we introduce a new approach called Vision-LLM assisted Pseudo-Labeling (VLM-PL). This technique uses Vision-LLM (VLM) to verify the correctness of pseudo ground-truths (GTs) without requiring additional model training. VLM-PL starts by deriving pseudo GTs from a pre-trained detector. Then, we generate custom queries for each pseudo GT using carefully designed prompt templates that combine image and text features. This allows the VLM to classify the correctness through its responses. Furthermore, VLM-PL integrates refined pseudo and real GTs from upcoming training, effectively combining new and old knowledge. Extensive experiments conducted on the Pascal VOC and MS COCO datasets not only highlight VLM-PL's exceptional performance in multi-scenario but also illuminate its effectiveness in dual-scenario by achieving state-of-the-art results in both.
- Rodeo: Replay for online object detection. In BMVC, 2020.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
- Rainbow memory: Continual learning with a memory of diverse samples. In CVPR, 2021.
- Yolov4: Optimal speed and accuracy of object detection. In arXiv, 2020.
- End-to-end object detection with transformers. In ECCV, 2020.
- Modeling missing annotations for incremental learning in object detection. In CVPRW, 2022.
- Riemannian walk for incremental learning: Understanding forgetting and intransigence. In ECCV, 2018.
- Continual learning with tiny episodic memories. In arXiv, 2019.
- Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023a.
- Ap-loss for accurate one-stage object detection. In TPAMI, 2020.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Gan memory with no forgetting. In NeurIPS, 2020.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Continual prototype evolution: Learning online from non-stationary data streams. In ICCV, 2021.
- Don’t stop learning: Towards continual learning for the clip model. In arXiv, 2022.
- Open world detr: transformer based open world object detection. In arXiv, 2022.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. In arXiv, 2024a.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024b.
- The pascal visual object classes (voc) challenge. In IJCV, 2010.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
- Tood: Task-aligned one-stage object detection. In ICCV, 2021.
- Overcoming catastrophic forgetting in incremental object detection via elastic response distillation. In CVPR, 2022.
- Ddgr: Continual learning with deep diffusion-based generative replay. In ICML, 2023.
- Improved schemes for episodic memory-based lifelong learning. In NeurIPS, 2020.
- Ow-detr: Open-world detection transformer. In CVPR, 2022.
- An end-to-end architecture for class-incremental object detection with knowledge distillation. In ICME, 2019.
- Exemplar-supported generative reproduction for class incremental learning. In BMVC, 2018.
- Deep residual learning for image recognition. In CVPR, 2016.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Class-incremental learning using diffusion model for distillation and replay. In ICCVW, 2023.
- Towards open world object detection. In CVPR, 2021a.
- Incremental object detection via meta-learning. In TPAMI, 2021b.
- Less-forgetting learning in deep neural networks. In arXiv, 2016.
- Alleviating catastrophic forgetting of incremental object detection via within-class and between-class knowledge distillation. In ICCV, 2023.
- Class-wise buffer management for incremental object detection: An effective buffer training strategy. In arXiv, 2023.
- Sddgr: Stable diffusion-based deep generative replay for class incremental object detection. In arXiv, 2024.
- Overcoming catastrophic forgetting in neural networks. In PNAS, 2017.
- Online continual learning on class incremental blurry task configuration with anytime inference. In arXiv, 2021.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. In IJCV, 2017.
- The power of scale for parameter-efficient prompt tuning. arXiv, 2021.
- Otter: A multi-modal model with in-context instruction tuning, 2023a.
- Rilod: Near real-time incremental learning for object detection at the edge. In SEC, 2019.
- Dn-detr: Accelerate detr training by introducing query denoising. In CVPR, 2022a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In arXiv, 2023b.
- Grounded language-image pre-training. In CVPR, 2022c.
- Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In NeurIPS, 2020.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv, 2021.
- Gligen: Open-set grounded text-to-image generation. In CVPR, 2023c.
- Learning without forgetting. In TPAMI, 2017.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Feature pyramid networks for object detection. In CVPR, 2017.
- Visual instruction tuning. In NeurIPS, 2023a.
- Incdet: In defense of elastic weight consolidation for incremental object detection. In TNNLS, 2020.
- Dab-detr: Dynamic anchor boxes are better queries for detr. In ICLR, 2022.
- Augmented box replay: Overcoming foreground shift for incremental object detection. In ICCV, 2023b.
- Continual detection transformer for incremental object detection. In CVPR, 2023c.
- Gradient episodic memory for continual learning. In NeurIPS, 2017.
- Decoupled weight decay regularization. In arXiv, 2017.
- Augmented geometric distillation for data-free incremental person reid. In CVPR, 2022.
- Overcoming catastrophic forgetting by neuron-level plasticity control. In AAAI, 2020.
- Libra r-cnn: Towards balanced learning for object detection. In CVPR, 2019.
- Faster ilod: Incremental learning for object detectors based on faster rcnn. In Pattern Recognition, 2020.
- Sid: Incremental learning for anchor-free object detection via selective and inter-related distillation. In CVIU, 2021.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
- Gdumb: A simple approach that questions our progress in continual learning. In ECCV, 2020.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- icarl: Incremental classifier and representation learning. In CVPR, 2017.
- Yolov3: An incremental improvement. In arXiv, 2018.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. In Connection Science, 1995.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Continual learning with deep generative replay. In NeurIPS, 2017a.
- Continual learning with deep generative replay. In NeurIPS, 2017b.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv, 2020.
- Incremental learning of object detectors without catastrophic forgetting. In ICCV, 2017.
- On learning the geodesic path for incremental learning. In CVPR, 2021.
- Clip model is an efficient continual learner. In arXiv, 2022.
- Attention is all you need. In NeurIPS, 2017.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
- Memory replay gans: Learning to generate new categories without forgetting. In NeurIPS, 2018.
- Incremental learning using conditional adversarial networks. In NeurIPS, 2019.
- Multi-view correlation distillation for incremental object detection. In Pattern Recognition, 2022.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- Ferret: Refer and ground anything anywhere at any granularity. In ICLR, 2024.
- Modeling context in referring expressions. In ECCV, 2016.
- Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
- Continual learning through synaptic intelligence. In ICML, 2017.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In ICLR, 2022.
- Class-incremental learning via deep model consolidation. In WACV, 2020.
- Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023.
- Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
- Lifelong object detection. In arXiv, 2020.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. In arXiv, 2023.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.