Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SegLLM: Multi-round Reasoning Segmentation (2410.18923v2)

Published 24 Oct 2024 in cs.CV and cs.AI

Abstract: We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi-round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in [email protected] for referring expression localization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Coco-stuff: Thing and stuff classes in context. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1209–1218, 2016. URL https://api.semanticscholar.org/CorpusID:4396518.
  6. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  7. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1290–1299, 2022.
  8. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  10. Part-aware panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5485–5494, 2021.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  16321–16330, 2021.
  13. Open-vocabulary universal image segmentation with maskclip. In Proceedings of the 40th International Conference on Machine Learning, pp.  8090–8102, 2023.
  14. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5356–5364, 2019.
  15. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  16. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  18. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.  787–798, 2014.
  19. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  20. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  21. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9579–9589, 2024.
  22. Bloom: A 176b-parameter open-access multilingual language model. 2023.
  23. Otter: A multi-modal model with in-context instruction tuning, 2023a.
  24. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.  19730–19742. PMLR, 2023b.
  25. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014a.
  26. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014b. URL https://api.semanticscholar.org/CorpusID:14113767.
  27. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  28. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  29. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  30. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  31. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023a.
  32. Perceptiongpt: Effectively fusing visual perception into llm. arXiv preprint arXiv:2311.06612, 2023b.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  34. Paco: Parts and attributes of common objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7141–7151, 2023.
  35. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13009–13018, 2024.
  36. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  37. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  39. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024a.
  40. Hierarchical open-vocabulary universal image segmentation. Advances in Neural Information Processing Systems, 36, 2024b.
  41. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
  42. See say and segment: Teaching lmms to overcome false premises. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13459–13469, 2024.
  43. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a.
  44. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18155–18165, 2022.
  45. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023b.
  46. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  47. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  48. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
  49. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp.  69–85. Springer, 2016.
  50. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  51. Next-chat: An lmm for chat, detection and segmentation. arXiv preprint arXiv:2311.04498, 2023a.
  52. NExt-chat: An LMM for chat, detection and segmentation. In Forty-first International Conference on Machine Learning, 2024.
  53. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023b.
  54. Scene parsing through ade20k dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5122–5130, 2017. URL https://api.semanticscholar.org/CorpusID:5636055.
  55. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  56. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.

Summary

  • The paper introduces SegLLM, which innovates by reintegrating historical mask data into multi-round segmentation to enhance conversational accuracy.
  • It employs a mask-encoding scheme and mask-aware decoding tokens to seamlessly unite visual context and bounding box information.
  • Empirical results demonstrate over 20% performance gains on multi-round segmentation benchmarks compared to state-of-the-art models.

An Analysis of SegLLM: Multi-round Reasoning Segmentation

The paper introduces SegLLM, a novel framework that enhances image segmentation processes using large multimodal models (VLMs) to foster a conversation-like interaction with the system. This is achievable through a multi-round reasoning segmentation model that integrates conversational memory into segmentation processes, recalling both visual and textual outputs across multiple interactions. SegLLM represents a significant advancement in leveraging LLMs for segmenting complex visual scenes and interacting with user queries beyond single text prompts.

Core Contributions and Methodology

SegLLM's core contribution lies in its ability to utilize a mask-aware multimodal LLM that re-integrates past segmentation outcomes into its input sequence. This empowers the model to not only comprehend complex user intentions but also facilitates segmentation tasks based on previously segmented entities. Among its design novelties, SegLLM features:

  1. Mask-Encoding Scheme: SegLLM introduces a scheme where reference mask data is actively reincorporated back into the input stream during inference. The model uses both the mask's semantic data and bounding box coordinates as embeddings, seamlessly integrated with the text input. This approach addresses the challenge of "inter-round" dependencies where the output from earlier segments informs subsequent tasks, improving multi-round conversational efficiency.
  2. Mask-Aware Decoding: The model incorporates [REF] and [SEG] tokens to facilitate the generation of new masks that consider both visual outputs and historical contextual interactions, enabling the system to manage complex segmentation queries effectively.
  3. Multi-round Dataset: The creation of a high-quality, multi-round interactive segmentation dataset (MRSeg) is a noteworthy supplement, designed to challenge existing segmentation models with a variety of relational, interactional, and hierarchical query types.

Results and Implications

Empirically, SegLLM outperformed existing state-of-the-art segmentation benchmarks by more than 20% on its designed MRSeg benchmark. In addition, it demonstrated noticeable advancements in single-round referring expression segmentation and localization tasks, achieving gains in performance metrics such as cIoU and [email protected]. These improvements highlight SegLLM's robustness in both handling multi-turn interactions and enhancing the model's general capabilities in more traditional single-pass segmentation tasks.

Evaluation and Future Directions

One of the striking elements observed was the model's sustainability in multi-turn settings where other models' performances typically degrade. The approach of actively using historical segmentation data places SegLLM ahead, offering a scalable solution to complex, conversational segmentation tasks often encountered in real-world applications.

Looking forward, the SegLLM model opens pathways for further exploration in AI conversational agents, particularly in areas dealing with interactive visual dialogue systems. Future developments may involve extending such frameworks to encompass even more intricate multi-modal integrations beyond image-text pairs, potentially incorporating auditory or sensory data, thus vastly enriching interaction experiences.

Furthermore, while SegLLM successfully integrates cognition from sequential interactions, future models could explore dynamic adaptation strategies to optimize performance as the number of interaction rounds potentially increases indefinitely in practical applications.

Conclusion

SegLLM represents a significant stride in moving from simple unidirectional segmentation tasks to more sophisticated, memory-intensive interaction models applicable in conversational AI domains. The introduction of mechanisms to recall and integrate visual and textual context paves the way for smarter, more context-aware AI systems, holding promise for advancements in how interactive systems can understand, process, and learn from multilayered dialogues over time.