Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper AI Researchers,Computer Vision Engineers,RL Researchers,NLP Researchers 3 days ago

CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks

large-language-models › reasoning
📄 Abstract

Abstract: We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO's key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an "output-to-reasoning" strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.
Authors (6)
Long Li
Shuichen Ji
Ziyang Luo
Nian Liu
Dingwen Zhang
Junwei Han
Submitted
November 1, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

CoT-Saliency introduces a unified framework for heterogeneous saliency tasks (SOD, CoSOD, SIS) by casting them as Chain-of-Thought (CoT) reasoning in VLMs. It proposes Confidence-Guided Policy Optimization (CGPO) for enhancing RL quality, which is more efficient than GRPO. An 'output-to-reasoning' strategy ensures logical consistency in SFT data, leading to high-fidelity reasoning and improved performance across diverse saliency tasks.

Business Value

Enables more robust and interpretable visual understanding systems, particularly for tasks requiring complex reasoning about image content, which can benefit applications in autonomous systems, content analysis, and robotics.