Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where
generated responses seem semantically plausible yet exhibit little or no
relevance to the input image. Previous studies reveal that this issue primarily
stems from LVLMs' over-reliance on language priors while disregarding the
visual information during decoding. To alleviate this issue, we introduce a
novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding
strategy, which adaptively strengthens the mutual dependency between generated
texts and input images to mitigate hallucinations. Unlike existing methods
solely focusing on text token sampling, we propose to jointly model the
contributions of visual and textual tokens to C-PMI, formulating hallucination
mitigation as a bi-level optimization problem aimed at maximizing mutual
information. To solve it, we design a token purification mechanism that
dynamically regulates the decoding process by sampling text tokens remaining
maximally relevant to the given image, while simultaneously refining image
tokens most pertinent to the generated response. Extensive experiments across
various benchmarks reveal that the proposed method significantly reduces
hallucinations in LVLMs while preserving decoding efficiency.
Authors (6)
Hao Fang
Changle Zhou
Jiawei Kong
Kuofeng Gao
Bin Chen
Shu-Tao Xia
Key Contributions
Introduced a Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy to reduce hallucinations in LVLMs. It adaptively strengthens the mutual dependency between generated text and input images by jointly modeling visual and textual tokens in a bi-level optimization problem.
Business Value
Enhances the reliability and trustworthiness of multimodal AI systems, enabling more accurate image captioning, visual question answering, and other vision-language applications.