Papers
Topics
Authors
Recent
Search
2000 character limit reached

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Published 29 Oct 2025 in cs.CV | (2510.25263v3)

Abstract: We propose LangHOPS, the first Multimodal LLM (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

Summary

  • The paper introduces LangHOPS, a novel framework that leverages language-grounded hierarchies and MLLMs to achieve state-of-the-art object-part segmentation.
  • It employs a hierarchical object-part parser combining language space embedding with refined MLLM queries, demonstrating 5.5% and 4.8% AP gains on challenging datasets.
  • The approach enhances fine-grained visual understanding and cross-dataset generalization, though it faces challenges in computational efficiency for real-time applications.

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Introduction

The paper "LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation" (2510.25263) introduces LangHOPS, a novel framework for open-vocabulary object-part instance segmentation, leveraging Multimodal LLMs (MLLMs). The framework aims to address the inherent challenges in segmenting object parts with varying granularity, enhancing both in-domain and cross-dataset segmentation capabilities through language-grounded hierarchies. Figure 1

Figure 1: Given a 2D image and user queries of candidate object-part categories, LangHOPS grounds object-part hierarchies in language space, using a MLLM to segment objects into parts.

Methodology: Framework Overview

LangHOPS is structured around several key components: an image backbone, an object segmentation module, object-part parser, and a part segmentation module. The object-part parser leverages a "Language-Grounded Hierarchies" module, embedding object-part hierarchies within the language space, and a "MLLM-based Parsing" module which outputs refined part queries suitable for segmentation. Figure 2

Figure 2: LangHOPS framework illustrating the integration of language-grounded hierarchies and MLLM for object-part parsing.

The framework processes input images to produce masks and categories corresponding to segmented object and part instances, maintaining contextual awareness and granularity-adaptation through structured language inputs and MLLM-driven refinement.

Experimental Results

LangHOPS' experimental validation spans multiple challenging scenarios, including in-domain and cross-dataset object-part segmentation, alongside zero-shot semantic segmentation tasks. It achieves state-of-the-art performance, improving by 5.5% in average precision (AP) in-domain and 4.8% cross-dataset on the PartImageNet dataset, alongside significant gains in unseen object parts segmentation on ADE20K dataset. Figure 3

Figure 3: Qualitative results of part-level segmentation by LangHOPS compared to baseline methods.

Ablation Studies and Insights

Ablation studies reveal the efficacy of LangHOPS' language-grounded hierarchy and MLLM-based parsing strategy, significantly enhancing part query representation and adaptation capabilities. The studies underscore the synergistic relationship between object and part segmentation modules, enabled by the integrated language-grounded and MLLM-driven approach.

Challenges and Future Directions

Despite its robust performance, LangHOPS poses computational demands significantly affecting real-time applications. Future directions for this work involve improving computation efficiency and extending its applications to 3D computer vision, particularly in contexts that require task-specific adaptations and improved dataset generalizations. Figure 4

Figure 4: Open-Vocabulary Part Instance Segmentation via Language-Space Modeling.

Conclusion

LangHOPS sets a new benchmark in open-vocabulary object-part segmentation through its innovative use of language-grounded hierarchies and MLLMs. It presents a compelling approach for fine-grained visual understanding, paving the way for advanced AI applications in diverse and dynamic contexts. The framework exemplifies how integrating structured language understanding into segmentation tasks can yield significant gains in both accuracy and generalizability across varied datasets and scenarios.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper introduces LangHOPS, a new AI system that can look at a picture, find the objects in it (like a bus or a cat), and then carefully cut out their parts (like wheels, windows, head, tail) even if it has never seen those exact part names before. This job is called “open-vocabulary object–part instance segmentation.” “Open-vocabulary” means it can handle new, user-given labels; “instance” means it separates different copies of the same thing (cat 1 vs. cat 2); and “part segmentation” means it finds the pieces that make up each object.

What were the researchers trying to do?

They had three main goals:

  • Make a system that can find objects and their parts at the same time, not just objects.
  • Handle different “levels” of detail—sometimes you want big parts (like “car body”), other times tiny parts (like “screws”).
  • Work even when the part names are new or come from different datasets, so it can generalize to new situations.

How did they do it?

Think of the system as a team with four steps, using both pictures and language to reason:

  • Step 1: Find the objects
    • The system first detects and outlines each object in the image (for example, “bus 1,” “bus 2”). This is like drawing a neat outline around every bus in a photo.
  • Step 2: Build a “parts map” using language
    • Instead of guessing parts just from image patterns, LangHOPS uses language knowledge to understand which parts belong to which objects—like a dictionary that knows a bus has wheels, windows, and doors, while a cat has a head, body, legs, and tail. This “language space” acts like an idea map that links objects to their parts.
  • Step 3: Ask a multimodal LLM (MLLM) for help
    • A multimodal LLM is an AI that can read text and look at images. LangHOPS gives it the object information and the possible part names (like “bus’s wheel,” “bus’s door”) and asks it to refine these “questions” so they match the actual image. This helps the system choose the right parts and the right level of detail (coarse vs. fine).
  • Step 4: Cut out the parts
    • With the refined “part questions,” the system then segments (cuts out) each part from the image, producing clean masks for “bus 1’s wheel 1,” “bus 1’s wheel 2,” and so on.

Two extra notes in simple terms:

  • “Language-grounded” means the system uses words and descriptions to guide what to look for, not just pixel patterns.
  • “Granularity” means how detailed you want to be—like breaking a LEGO model into big chunks or tiny bricks.

What did they find?

The team tested LangHOPS on several challenging benchmarks and scenarios:

  • Stronger accuracy than previous methods:
    • On a car-and-animals parts dataset called PartImageNet, LangHOPS beat earlier methods by about 5.5% (when trained and tested on similar data) and by about 4.8% (when trained on one dataset and tested on a different one).
    • In a “zero-shot” test (where the model must segment unseen parts) on ADE20K, LangHOPS improved unseen-part accuracy by about 2.5% mIoU.
  • Better at adapting to different detail levels:
    • When trained with more varied datasets (including ones with different part granularities), LangHOPS gained up to +10% in overall performance on PartImageNet.
  • Parts help objects, too:
    • Training the system to segment parts didn’t just help parts—it also improved object segmentation by about 5.4% on one dataset. In other words, learning parts sharpened the model’s understanding of whole objects.

Overall, the results show that connecting images to language (to understand object–part hierarchies) and using an MLLM for reasoning leads to cleaner, more accurate part segmentation—especially when the labels are new or the detail level changes.

Why does this matter?

  • Practical uses:
    • Robotics: A robot needs to know not just “this is a microwave,” but also “this is the handle” or “this is the door hinge” to interact correctly.
    • Image editing and AR: Want to recolor only a car’s wheels or replace a laptop’s screen in a photo? Precise part masks make that easy.
    • Education and search: Systems can better explain what things are made of and help people find exactly the parts they care about.
  • Bigger picture:
    • LangHOPS shows that mixing vision with language (especially using powerful LLMs) helps computers understand scenes in a more human-like, structured way. It can handle new terms and switch between coarse and fine detail on demand.
  • Limitations and next steps:
    • It’s more computationally heavy than some older methods because it uses a big LLM.
    • It mainly trained on common objects and parts, so very specialized tasks might still need fine-tuning.
    • A promising future direction is extending these ideas from 2D images to 3D understanding, which could help in AR, VR, and robotics.

In short, LangHOPS is a step toward smarter vision systems that can understand not just what is in a picture, but how the pieces fit together—much like how people think about “wholes” and “parts.”

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.