Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities
across a wide range of vision-language tasks. However, due to the restricted
input resolutions, MLLMs face significant challenges in precisely understanding
and localizing visual details in high-resolution images -- particularly when
dealing with extra-small objects embedded in cluttered contexts. To address
this issue, we propose \textsc{FineRS}, a two-stage MLLM-based reinforcement
learning framework for jointly reasoning and segmenting extremely small objects
within high-resolution scenes. \textsc{FineRS} adopts a coarse-to-fine pipeline
comprising Global Semantic Exploration (GSE) and Localized Perceptual
Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to
generate a textural response and a coarse target region, while LPR refines this
region to produce an accurate bounding box and segmentation mask. To couple the
two stages, we introduce a locate-informed retrospective reward, where LPR's
outputs are used to optimize GSE for more robust coarse region exploration. %
Additionally, we present \textsc{FineRS}-4k, a new dataset for evaluating MLLMs
on attribute-level reasoning and pixel-level segmentation on subtle,
small-scale targets in complex high-resolution scenes. Experimental results on
\textsc{FineRS}-4k and public datasets demonstrate that our method consistently
outperforms state-of-the-art MLLM-based approaches on both instruction-guided
segmentation and visual reasoning tasks.
Authors (7)
Lu Zhang
Jiazuo Yu
Haomiao Xiong
Ping Hu
Yunzhi Zhuge
Huchuan Lu
+1 more
Submitted
October 24, 2025
Key Contributions
FineRS is a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects in high-resolution images. It uses a coarse-to-fine pipeline (GSE and LPR) to address MLLM limitations with detailed visual information.
Business Value
Enhances the capability of AI systems to analyze complex visual data with fine details, crucial for applications like medical diagnosis, quality control, and autonomous systems requiring precise object identification.