Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception (2403.02969v2)

Published 5 Mar 2024 in cs.CV

Abstract: Multimodal LLM (MLLMs) leverages LLMs as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However, there still remains a gap in providing fine-grained pixel-level perceptions and extending interactions beyond text-specific inputs. In this work, we propose {\bf{AnyRef}}, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references, such as texts, boxes, images, or audio. This innovation empowers users with greater flexibility to engage with the model beyond textual and regional prompts, without modality-specific designs. Through our proposed refocusing mechanism, the generated grounding output is guided to better focus on the referenced object, implicitly incorporating additional pixel-level supervision. This simple modification utilizes attention scores generated during the inference of LLM, eliminating the need for extra computations while exhibiting performance enhancements in both grounding masks and referring expressions. With only publicly available training data, our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Disclip: Open-vocabulary referring expression generation. arXiv preprint arXiv:2305.19108, 2023.
  3. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018.
  4. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  5. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  7. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  8. Avsegformer: Audio-visual segmentation with transformer. arXiv preprint arXiv:2307.01146, 2023.
  9. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  10. Improving audio-visual segmentation with bidirectional generation. arXiv preprint arXiv:2308.08288, 2023.
  11. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  12. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. In European Conference on Computer Vision, pages 108–126. Springer, 2022.
  13. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  14. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  15. Segment anything. arXiv:2304.02643, 2023.
  16. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  17. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231, Prague, Czech Republic, 2007. Association for Computational Linguistics.
  18. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  19. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
  20. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  21. Hear to segment: Unmixing the audio to guide the semantic segmentation. arXiv preprint arXiv:2305.07223, 2023.
  22. Gres: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592–23601, 2023a.
  23. Visual instruction tuning. In NeurIPS, 2023b.
  24. Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision, pages 4856–4864, 2017.
  25. Polyformer: Referring image segmentation as sequential polygon generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18653–18663, 2023c.
  26. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  27. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
  28. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7086–7096, 2022.
  29. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. IEEE, 2016.
  30. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  31. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  32. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 123(1):74–93, 2017.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  34. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  35. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  36. PACO: Parts and attributes of common objects. In arXiv preprint arXiv:2301.01795, 2023.
  37. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  38. Generating easy-to-understand referring expressions for target identifications. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5794–5803, 2019.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  40. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  41. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023a.
  42. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023b.
  43. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023c.
  44. Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023d.
  45. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022.
  46. Phrasecut: Language-based image segmentation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10216–10225, 2020.
  47. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  48. Universal instance perception as object discovery and retrieval. In CVPR, 2023.
  49. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR, 2022.
  50. Whether you can locate or not? interactive referring expression generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4697–4706, 2023.
  51. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016a.
  52. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016b.
  53. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016c.
  54. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7282–7290, 2017.
  55. Feature-proxy transformer for few-shot segmentation. Advances in Neural Information Processing Systems, 35:6575–6588, 2022.
  56. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
  57. Gpt4roi: Instruction tuning large language model on region-of-interest, 2023b.
  58. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
  59. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
  60. Audio–visual segmentation. In European Conference on Computer Vision, pages 386–403. Springer, 2022.
  61. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  62. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023a.
  63. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Junwen He (5 papers)
  2. Yifan Wang (319 papers)
  3. Lijun Wang (51 papers)
  4. Huchuan Lu (199 papers)
  5. Jun-Yan He (27 papers)
  6. Jin-Peng Lan (7 papers)
  7. Bin Luo (209 papers)
  8. Xuansong Xie (69 papers)
Citations (11)