arxiv_ai 95% Match Research Paper AI researchers,Computer vision engineers,NLP practitioners,Robotics developers 2 weeks ago

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

large-language-models › multimodal-llms

📄 Abstract

Abstract: Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

Authors (7)

Ye Liu

Zongyang Ma

Junfu Pu

Zhongang Qi

Yang Wu

Ying Shan

+1 more

Submitted

September 22, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces UniPixel, a large multi-modal model that unifies object referring and pixel-level segmentation for enhanced visual reasoning. It enables flexible comprehension of visual prompts and generation of mask-grounded responses, bridging the gap between holistic image understanding and fine-grained perception.

Business Value

Enables more sophisticated visual understanding for applications like autonomous driving, robotics, and content moderation, allowing systems to precisely identify and segment objects based on natural language descriptions.

Paper Metadata

Innovation Type

Architectural

Deployment Feasibility

Feasible for large-scale deployment given the trend towards powerful LMMs, but requires significant computational resources.

Limitations Addressed

Lack of integration between referring and segmentation tasks in LMMs,Limited fine-grained pixel-level understanding capabilities,Inability of previous models to integrate perception with visual reasoning

Technical Tags

multi-modal modelsvisual reasoningpixel-level segmentationobject referringlanguage groundingmask generationvisual promptsfine-grained perceptionlarge multi-modal models (LMMs)transformer architecture

Research Topics

Multimodal AIComputer VisionNatural Language ProcessingVisual ReasoningGenerative Models

Methods & Architectures

UniPixel modelMask-grounded response generationPixel-level alignmentIntegration of referring and segmentation Large Multi-modal Models (LMMs)Transformer-based models

Applications & Tasks

Computer Vision Natural Language Processing Robotics Image Understanding Integrating pixel-level perception with visual reasoningAchieving fine-grained pixel-level alignmentUnified object referring and segmentation Object referringImage segmentationVisual question answeringImage captioningPixel-level visual reasoning

Related Fields

Computer VisionNatural Language ProcessingMultimodal AIDeep LearningRobotics

Keywords

Multimodal AILarge Language ModelsVisual ReasoningObject ReferringSegmentationPixel-levelLanguage GroundingComputer VisionNLPTransformerGenerative ModelsImage Understanding

Academic Context

#Multimodal AI#Computer Vision#Natural Language Processing#Visual Reasoning#Generative Models

Commercial Potential

Potential Products

Advanced image editing toolsRobotic vision systemsInteractive visual search enginesContent analysis platforms

Target Industries

TechnologyMedia and EntertainmentRetailAutomotiveSecurity

Use Case Examples

Precisely segmenting specific objects in an image based on a textual descriptionEnabling robots to identify and interact with specific items in a sceneAutomated content moderation by identifying and localizing specific elements

Competitive Edge

Aims to surpass existing LMMs by integrating fine-grained pixel-level capabilities with broader visual reasoning, offering a more comprehensive understanding than models focused solely on holistic or region-level analysis.

Market Opportunity

The market for advanced AI vision and language understanding is rapidly expanding.

Revenue Models

API accesslicensing of the modelspecialized AI solutions.

Resource Requirements

Compute Needs

High, typical for large multi-modal models, requiring significant GPU resources for training and inference.

Data Requirements

Requires large-scale datasets with detailed annotations for object referring and segmentation, paired with textual descriptions.

Deployment Constraints

Computational cost,Need for high-quality annotated data

Scalability

Scales with the size of the multi-modal model and the complexity of the visual reasoning tasks.

Production Readiness

Maturity Level

Research

Time to Market

Medium-term, dependent on further development and optimization.

View Full Paper Back to Papers