Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Researchers,Computer Vision Engineers,NLP Engineers,Multimodal AI Developers 2 weeks ago

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

large-language-models › multimodal-llms
📄 Abstract

Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
Authors (6)
Hao Fang
Changle Zhou
Jiawei Kong
Kuofeng Gao
Bin Chen
Shu-Tao Xia
Submitted
May 26, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduced a Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy to reduce hallucinations in LVLMs. It adaptively strengthens the mutual dependency between generated text and input images by jointly modeling visual and textual tokens in a bi-level optimization problem.

Business Value

Enhances the reliability and trustworthiness of multimodal AI systems, enabling more accurate image captioning, visual question answering, and other vision-language applications.