Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We present the first unified framework that jointly handles three
operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting
each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model
(VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm:
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT
quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a
lightweight single-sample algorithm that leverages the discrepancy between
reward and model confidence as a per-sample advantage signal. This design
naturally focuses updates on informative responses while eliminating group
sampling, thereby addressing GRPO's key limitations: confidence-agnostic
learning, signal dilution, and prohibitive computational overhead. We also
introduce an "output-to-reasoning" strategy to construct high-fidelity SFT data
that ensures logical consistency with ground-truth masks. Experiments show our
model matches or outperforms specialized SOTA methods and strong closed-source
VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for
CoSOD, surpassing the prior best by 8.0 percentage points, despite using far
less training data.
Authors (6)
Long Li
Shuichen Ji
Ziyang Luo
Nian Liu
Dingwen Zhang
Junwei Han
Submitted
November 1, 2025
Key Contributions
CoT-Saliency introduces a unified framework for heterogeneous saliency tasks (SOD, CoSOD, SIS) by casting them as Chain-of-Thought (CoT) reasoning in VLMs. It proposes Confidence-Guided Policy Optimization (CGPO) for enhancing RL quality, which is more efficient than GRPO. An 'output-to-reasoning' strategy ensures logical consistency in SFT data, leading to high-fidelity reasoning and improved performance across diverse saliency tasks.
Business Value
Enables more robust and interpretable visual understanding systems, particularly for tasks requiring complex reasoning about image content, which can benefit applications in autonomous systems, content analysis, and robotics.