arxiv_cv 95% Match research paper AI researchers,robotics engineers,computer vision scientists,NLP researchers 1 week ago

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

large-language-models › multimodal-llms

📄 Abstract

Abstract: We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

Authors (6)

Yang Miao

Jan-Nico Zaech

Xi Wang

Fabien Despinoy

Danda Pani Paudel

Luc Van Gool

Submitted

October 29, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

LangHOPS is the first MLLM-based framework for open-vocabulary object-part instance segmentation. It grounds object-part hierarchies in language space, leveraging MLLM capabilities for rich knowledge and reasoning to link multi-granularity concepts.

Business Value

Enables more sophisticated visual understanding systems that can interpret complex scenes and object compositions based on natural language descriptions, crucial for robotics and advanced AI applications.

Paper Metadata

Innovation Type

novel MLLM-based framework for hierarchical segmentation

Deployment Feasibility

Requires significant computational resources due to the MLLM component, but offers advanced capabilities for complex vision tasks.

Limitations Addressed

Reliance on heuristic or learnable visual grouping in prior approaches, and the inability to handle open-vocabulary and hierarchical object-part relationships effectively.

Performance Gains

Achieves state-of-the-art results, significantly outperforming previous methods in both in-domain and cross-dataset scenarios, and demonstrating strong zero-shot capabilities.

Technical Tags

multimodal large language models (MLLMs)open-vocabulary segmentationobject-part segmentationhierarchical segmentationlanguage groundinginstance segmentationzero-shot learning

Research Topics

multimodal AIlarge language modelscomputer visionsemantic understandingopen-vocabulary recognition

Methods & Architectures

integration of MLLMlanguage groundinghierarchical parsingvisual groupingzero-shot evaluation Multimodal Large Language Model (MLLM)

Applications & Tasks

robotics image understanding content analysis human-robot interaction open-vocabulary object-part segmentationhandling hierarchical relationshipsgrounding visual concepts in language object-part instance segmentationzero-shot semantic segmentation

Datasets & Benchmarks

Datasets

PartImageNet, ADE20K

Benchmarks

state-of-the-art results • surpassing previous methods by 5.5% AP (in-domain) on PartImageNet • surpassing previous methods by 4.8% AP (cross-dataset) on PartImageNet • 2.5% mIOU on unseen object parts in ADE20K

Average Precision (AP)mIOU

Related Fields

natural language processingcomputer visionroboticsartificial intelligence

Keywords

LangHOPSMLLMopen-vocabularyobject-part segmentationhierarchicallanguage groundinginstance segmentationzero-shotPartImageNetADE20Kmultimodal AI

Academic Context

#multimodal AI#large language models#computer vision#semantic understanding#open-vocabulary recognition

Commercial Potential

Potential Products

advanced visual understanding systems for robotsAI-powered image annotation toolsmultimodal search engines

Target Industries

roboticsautonomous systemse-commercecontent creationaugmented reality

Use Case Examples

enabling robots to identify and manipulate specific parts of objectsgenerating detailed scene descriptions with object-part relationshipsallowing users to query images based on complex object hierarchies

Competitive Edge

Pioneers the use of MLLMs for open-vocabulary hierarchical part segmentation, offering capabilities beyond traditional vision-language models.

Market Opportunity

Rapidly growing market for multimodal AI and advanced vision systems.

Revenue Models

Licensing of the LangHOPS frameworkAPI access for advanced vision services.

Resource Requirements

Compute Needs

Very high for training, high for inference.

Data Requirements

Large datasets with annotated object-part hierarchies and corresponding images.

Deployment Constraints

High computational cost and latency associated with MLLMs.

Scalability

Scalable with significant computational resources and optimized MLLM inference techniques.

Production Readiness

Maturity Level

Research prototype

Time to Market

3-5 years

Patent Potential

High, due to the novel integration of MLLMs for a complex vision task.

View Full Paper Back to Papers