arxiv_ai 95% Match Research Paper LMM researchers,Computer vision researchers,Robotics engineers,AI developers 1 week ago

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

large-language-models › multimodal-llms

📄 Abstract

Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.

Authors (10)

Ansel Blume

Jeonghwan Kim

Hyeonjeong Ha

Elen Chatikyan

Xiaomeng Jin

Khanh Duy Nguyen

+4 more

Submitted

May 27, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces PARTONOMY, a benchmark designed for pixel-level part grounding in Large Multimodal Models (LMMs), addressing their struggle with identifying object parts. PARTONOMY includes specialized concepts and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations, highlighting a critical gap in LMMs' part grounding abilities.

Business Value

Drives the development of more sophisticated AI systems capable of detailed object understanding, which is crucial for applications like robotics, autonomous driving, and detailed image analysis.

Paper Metadata

Innovation Type

Benchmark and Dataset

Deployment Feasibility

High. The benchmark facilitates targeted improvements in LMMs.

Limitations Addressed

State-of-the-art LMMs struggle with part-level visual understanding and fine-grained, compositional reasoning. Existing datasets often focus on generic parts, failing to challenge models on specialized concepts or part-whole relationships. PARTONOMY addresses these limitations.

Technical Tags

part-level understandinglarge multimodal models (LMMs)part groundingvisual segmentationcompositional reasoningobject partspart-whole relationshipsfine-grained understandingbenchmark dataset

Research Topics

Multimodal AIComputer VisionNatural Language ProcessingFine-grained RecognitionVisual Reasoning

Methods & Architectures

Part groundingPixel-level segmentationCompositional reasoning evaluationBenchmark dataset creation Large Multimodal Models (LMMs)

Applications & Tasks

Computer Vision Robotics Image Understanding Multimodal AI Part GroundingFine-grained Visual UnderstandingCompositional Reasoning Perform pixel-level part groundingUnderstand part-whole relationshipsEvaluate LMMs on fine-grained visual tasks

Datasets & Benchmarks

Datasets

PARTONOMY

Benchmarks

PARTONOMY benchmark (5.9% gIoU for LISA-13B)

gIoU (generalized Intersection over Union)Part grounding accuracyCompositional reasoning performance

Related Fields

Computer VisionMultimodal LearningNatural Language ProcessingRoboticsFine-grained Recognition

Keywords

part groundingLMMmultimodal AIfine-grained understandingvisual reasoningobject partssegmentationcompositional reasoningbenchmarkPARTONOMYcomputer vision

Academic Context

#Multimodal AI#Computer Vision#Natural Language Processing#Fine-grained Recognition#Visual Reasoning

Commercial Potential

Potential Products

Advanced image understanding modules for AI systemsRobotics perception systems with detailed object analysis

Target Industries

TechnologyRoboticsAutomotiveE-commerceManufacturing

Use Case Examples

Robots identifying specific parts of an object for manipulationAI systems providing detailed descriptions of images, including object componentsAutomated quality control systems detecting defects at a part level

Competitive Edge

Focuses specifically on the challenging task of part-level understanding, a capability often lacking in current general-purpose LMMs.

Market Opportunity

Large, driven by the demand for more capable multimodal AI systems.

Revenue Models

Licensing of benchmark evaluation toolsconsulting for LMM development.

Resource Requirements

Compute Needs

Moderate to high for training and evaluating LMMs on the benchmark.

Data Requirements

Requires the PARTONOMY dataset, which includes images with part-level annotations.

Deployment Constraints

Requires LMMs with strong visual grounding capabilities.

Scalability

The benchmark itself is scalable; performance of LMMs on it depends on their architecture.

Production Readiness

Maturity Level

Research

Time to Market

Medium, for developing LMMs that excel on this benchmark.

Patent Potential

Low, as it is primarily a benchmark and dataset.

View Full Paper Back to Papers