Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NExT-Chat: An LMM for Chat, Detection and Segmentation (2311.04498v4)

Published 8 Nov 2023 in cs.CV, cs.AI, and cs.CL
NExT-Chat: An LMM for Chat, Detection and Segmentation

Abstract: The development of LLMs has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance the level of visual comprehension, recent studies have equipped LMMs with region-level understanding capabilities by representing object bounding box coordinates as a series of text sequences (pix2seq). In this paper, we introduce a novel paradigm for object location modeling called pix2emb method, where we ask the LMM to output the location embeddings and then decode them with different decoders. This paradigm allows us to use different location formats (such as bounding boxes and masks) in multimodal conversations. Leveraging the proposed pix2emb method, we train an LMM named NExT-Chat and demonstrate its capability of handling multiple tasks like visual grounding, region captioning, and grounded reasoning. Comprehensive experiments show the effectiveness of our NExT-Chat on various tasks, e.g., NExT-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NExT-Chat (68.9) vs. LISA (67.9) on referring expression segmentation task, and NExT-Chat (79.6) vs. Kosmos-2 (62.3) on region caption task. The code and model are released at https://github.com/NExT-ChatV/NExT-Chat.

An Expert Analysis of "NExT-Chat: An LMM for Chat, Detection, and Segmentation"

The growing intersection of LLMs and visual comprehension has given rise to large multimodal models (LMMs). A notable contribution in this field is the research presented in the paper titled "NExT-Chat: An LMM for Chat, Detection, and Segmentation." The authors introduce a novel paradigm for integrating object location modeling into LMMs through the pix2emb method, which represents a significant evolution from the previous pix2seq method. Where pix2seq converts object coordinates into textual sequences for LMM consumption, pix2emb converts these locations into embeddings, allowing for greater flexibility in location formats such as bounding boxes and segmentation masks.

The paper details the development and capabilities of NExT-Chat, an LMM that incorporates this new pix2emb method to excel in tasks including visual grounding, region captioning, and grounded image captioning. The NExT-Chat model demonstrates significant performance advancements compared to existing models on specific datasets, such as an accuracy of 87.7 on the POPE-Random dataset, outperforming Shikra at 86.9. Also, it achieves an IoU score of 68.9 in referring expression segmentation, surpassing LISA's 67.9, and a CIDEr score of 79.6 for RefCOCOg region captioning, significantly exceeding Kosmos-2’s score of 62.3.

Methodological Innovations

The pix2emb paradigm represents an important methodological shift in how LMMs can process and interpret visual data. By encoding object locations as embeddings rather than discrete text tokens, the NExT-Chat model can handle a variety of tasks more effectively. The model distinguishes itself by integrating tasks that require fine-grained visual understanding, like distinguishing individual objects within an image, rather than treating the image as a whole.

The introduction of <trigger> and <loc> tokens within this model facilitates a dual role in handling both detection and segmentation tasks. This allows the model to output location data in multiple formats without losing the contextual richness needed for subsequent language tasks. The cycle loss method further strengthens the training of the location encoder and decoder, improving the alignment between these components.

Empirical Evaluation

NExT-Chat's performance was evaluated against several benchmarks, where it demonstrated enhanced capabilities in handling region-level tasks. On tasks like visual grounding, the model effectively managed complex queries and demonstrated an ability to reason about object interactions within a scene, outperforming several state-of-the-art baselines.

Implications and Future Directions

This research opens several avenues for future exploration, particularly in reducing the dependency on extensive datasets for training high-accuracy models. The pix2emb method provides a flexible framework that could potentially lower the resource barriers in training future LMM models. Additionally, this method could be extended to handle more complex multimodal tasks, involving dynamic visual data such as videos or 3D scenes.

While the paper indicates significant improvements, the authors note limitations regarding the model's capability to handle multiple image inputs simultaneously and its performance across diverse domains such as medical imaging. Addressing these limitations could significantly broaden the applicability of LMMs in real-world tasks beyond traditional visual understanding frameworks.

In conclusion, the introduction of NExT-Chat demonstrates a noteworthy advance in the integration of language and vision tasks, showcasing how LMMs can evolve to address increasingly complex scenarios. The proposed pix2emb method offers a template for future research aiming to enhance the interpretability and contextual understanding within multimodal AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  3. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  4. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  5. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  6. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  7. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1769–1779, 2021.
  8. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16321–16330, 2021.
  9. Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020.
  10. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  11. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  12. Segment anything. arXiv:2304.02643, 2023.
  13. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  14. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  15. Mimic-it: Multi-modal in-context instruction tuning. 2023a.
  16. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  17. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  18. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
  19. Gres: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592–23601, 2023a.
  20. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023b.
  21. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023c.
  22. Visual instruction tuning, 2023d.
  23. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023e.
  24. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 10034–10043, 2020.
  25. Point and ask: Incorporating pointing into visual question answering. 2020.
  26. Generation and comprehension of unambiguous object descriptions. In Proceedings of CVPR, pages 11–20, 2016.
  27. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  28. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  31. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022a.
  32. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023a.
  33. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023b.
  34. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022b.
  35. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  36. Unitab: Unifying text and box outputs for grounded vision-language modeling. In European Conference on Computer Vision, pages 521–539. Springer, 2022a.
  37. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18155–18165, 2022b.
  38. Pevl: Position-enhanced pre-training and prompt tuning for vision-language models. arXiv preprint arXiv:2205.11169, 2022.
  39. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  40. Modeling context in referring expressions. In Proceedings of ECCV, pages 69–85. Springer, 2016.
  41. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018.
  42. From recognition to cognition: Visual commonsense reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  43. Transfer visual prompt generator across llms. arXiv preprint arXiv:2305.01278, 2023a.
  44. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023b.
  45. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  46. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  47. Visual7w: Grounded question answering in images, 2016.
  48. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023a.
  49. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ao Zhang (45 papers)
  2. Yuan Yao (292 papers)
  3. Wei Ji (202 papers)
  4. Zhiyuan Liu (433 papers)
  5. Tat-Seng Chua (359 papers)
Citations (34)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com