Osprey: Pixel Understanding with Visual Instruction Tuning
Abstract: Multimodal LLMs (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-LLM by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.
- Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Language models are few-shot learners. In NeurIPS, pages 1877–1901, 2020.
- Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
- Detect what you can: Detecting and representing objects using holistic models and body parts. In ECCV, pages 1971–1978, 2014.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Vocabulary-free image classification. In NeurIPS, 2023.
- The cityscapes dataset for semantic urban scene understanding. In ECCV, pages 3213–3223, 2016.
- Instructblip: Towards general-purpose visionlanguage models with instruction tuning. In NeurIPS, 2023.
- Open-vocabulary universal image segmentation with maskclip. In ICML, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, pages 5356–5364, 2019.
- Partimagenet: A large, high-quality dataset of parts. In ECCV, pages 128–145. Springer, 2022.
- Deep residual learning for image recognition. In ECCV, pages 770–778, 2016.
- Segment anything in high quality. arXiv preprint arXiv:2306.01567, 2023.
- Segment anything. In ICCV, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Elevater: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS, pages 9287–9301, 2022.
- Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1, 2023b.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023c.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023d.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023e.
- Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023f.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. In NeurIPS, 2023b.
- A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. In ICLR, 2019.
- Generation and comprehension of unambiguous object descriptions. In ECCV, pages 11–20, 2016.
- Mircosoft. Deepseed. https://www.deepspeed.ai/, 2023.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Aims: All-inclusive multi-level segmentation. In NeurIPS, 2023a.
- High quality entity segmentation. In ICCV, pages 4047–4056, 2023b.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Paco: Parts and attributes of common objects. In CVPR, pages 7141–7151, 2023.
- Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Hierarchical open-vocabulary universal image segmentation. In NeurIPS, 2023.
- Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- Modeling context in referring expressions. In ECCV, pages 69–85. Springer, 2016.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
- From recognition to cognition: Visual commonsense reasoning. In CVPR, pages 6720–6731, 2019.
- Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023.
- Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Segment everything everywhere all at once. In NeurIPS, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.