Papers
Topics
Authors
Recent
2000 character limit reached

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

Published 25 May 2024 in cs.CV | (2405.16071v2)

Abstract: One fundamental task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs of different tasks, which hinders them to find out precise language descriptions. In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. During training, DynRefer stochastically aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region. During inference, DynRefer performs selectively multimodal referring by sampling proper region representations for tasks from the nested views based on image and task priors. This allows the visual information for referring to better match human preferences, thereby improving the representational adaptability of region-level multimodal models. Experiments show that DynRefer brings mutual improvement upon broad tasks including region-level captioning, open-vocabulary region recognition and attribute detection. Furthermore, DynRefer achieves state-of-the-art results on multiple region-level multimodal tasks using a single model. Code is available at https://github.com/callsys/DynRefer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. NeurIPS, 35:32897–32912, 2022.
  3. P. Binda and M. C. Morrone. Vision during saccadic eye movements. The Journal of Comparative Neurology, 292(4):497–523,, 1990.
  4. Open-vocabulary attribute detection. In IEEE CVPR, pages 7041–7050, 2023.
  5. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023a.
  6. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023b.
  7. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023c.
  8. Ovarnet: Towards open-vocabulary object attribute recognition. In IEEE CVPR, pages 23518–23527, 2023d.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. Human photoreceptor topography. The Journal of Comparative Neurology, 292(4):497–523,, 1990.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  12. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL, pages 4171–4186, 2019.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  14. Eva: Exploring the limits of masked visual representation learning at scale. In IEEE CVPR, pages 19358–19369, 2023.
  15. Describing objects by their attributes. In IEEE CVPR, pages 1778–1785, 2009.
  16. Ross Girshick. Fast r-cnn. In IEEE ICCV, pages 1440–1448, 2015.
  17. Regiongpt: Towards region understanding vision language model, 2024.
  18. Strasburger H. even myths on crowding and peripheral vision. i-Perception, 11(3):1–46, 2018.
  19. Mask r-cnn. In IEEE ICCV, pages 2961–2969, 2017.
  20. Deep imbalanced learning for face recognition and attribute prediction. IEEE TPAMI, 42(11):2781–2794, 2019.
  21. Fapn: Feature-aligned pyramid network for dense image prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 864–873, 2021.
  22. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023.
  23. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  24. Densecap: Fully convolutional localization networks for dense captioning. In IEEE CVPR, pages 4565–4574, 2016.
  25. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, pages 32–73, 2017.
  26. Lavis: A library for language-vision intelligence. arXiv preprint arXiv:2209.09019, 2022a.
  27. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34:9694–9705, 2021.
  28. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900, 2022b.
  29. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023.
  30. Learning object context for dense captioning. In AAAI, pages 8650–8657, 2019.
  31. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, pages 121–137, 2020.
  32. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  33. Visual instruction tuning. NeurIPS, 36, 2023.
  34. Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.
  35. Capdet: Unifying dense captioning and open-world detection pretraining. In IEEE CVPR, pages 15233–15243, 2023.
  36. Coco attributes: Attributes for people, animals, and objects. In ECCV, pages 85–100, 2016.
  37. Kosmos-2: Grounding multimodal large language models to the world. ICLR, 2024.
  38. Learning to predict visual attributes in the wild. In IEEE CVPR, pages 13018–13028, 2021.
  39. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  40. Glamm: Pixel grounding large multimodal model. IEEE CVPR, 2024.
  41. Asymmetric loss for multi-label classification. In IEEE CVPR, pages 82–91, 2021.
  42. High-resolution image synthesis with latent diffusion models. In IEEE CVPR, pages 10674–10685, 2022.
  43. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, pages 25278–25294, 2022.
  44. Region-object relation-aware dense captioning via transformer. IEEE TNNLS, 2022.
  45. Dcmstrd: End-to-end dense captioning via multi-scale transformer decoding. IEEE Transactions on Multimedia, pages 1–13, 2024. doi: 10.1109/TMM.2024.3369863.
  46. Attention modulates trans-saccadic integration. The Journal of Comparative Neurology, 142:1–10, 2018.
  47. Alpha-clip: A clip model focusing on wherever you want. IEEE CVPR, 2024.
  48. Attention is all you need. NeurIPS, 2017.
  49. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. ICLR, 2024.
  50. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  51. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  52. Zero-shot learning-the good, the bad and the ugly. In IEEE CVPR, pages 4582–4591, 2017.
  53. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. arXiv preprint arXiv:2401.06197, 2024.
  54. Dense captioning with joint inference and visual context. In IEEE CVPR, pages 2193–2202, 2017.
  55. Context and attribute grounded dense captioning. In IEEE CVPR, pages 6241–6250, 2019.
  56. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  57. Modeling context in referring expressions. In ECCV, pages 69–85, 2016.
  58. A joint speaker-listener-reinforcer model for referring expressions. In IEEE CVPR, pages 7282–7290, 2017.
  59. Osprey: Pixel understanding with visual instruction tuning. IEEE CVPR, 2024.
  60. Attributes learning network for generalized zero-shot learning. Neural Networks, 150:112–118, 2022.
  61. Multi-grained vision language pre-training: Aligning texts with visual concepts. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, ICML, volume 162, pages 25994–26009, 2022.
  62. Sigmoid loss for language image pre-training. In IEEE ICCV, pages 11975–11986, 2023.
  63. Task-aware attention model for clothing attribute prediction. IEEE TCSVT, 30(4):1051–1064, 2019.
  64. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023a.
  65. OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  66. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023b.
  67. Generative prompt model for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6351–6361, 2023.
  68. Controlcap: Controllable region-level captioning, 2024.
  69. Regionclip: Region-based language-image pretraining. In IEEE CVPR, pages 16793–16803, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.