Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper AI researchers,Computer vision engineers,NLP practitioners,Robotics developers 2 weeks ago

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

large-language-models › multimodal-llms
📄 Abstract

Abstract: Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
Authors (7)
Ye Liu
Zongyang Ma
Junfu Pu
Zhongang Qi
Yang Wu
Ying Shan
+1 more
Submitted
September 22, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces UniPixel, a large multi-modal model that unifies object referring and pixel-level segmentation for enhanced visual reasoning. It enables flexible comprehension of visual prompts and generation of mask-grounded responses, bridging the gap between holistic image understanding and fine-grained perception.

Business Value

Enables more sophisticated visual understanding for applications like autonomous driving, robotics, and content moderation, allowing systems to precisely identify and segment objects based on natural language descriptions.