arxiv_cl 95% Match Research Paper AI Researchers,Computer Vision Engineers,NLP Engineers,Multimodal AI Developers 2 weeks ago

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

large-language-models › multimodal-llms

📄 Abstract

Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.

Authors (6)

Hao Fang

Changle Zhou

Jiawei Kong

Kuofeng Gao

Bin Chen

Shu-Tao Xia

Submitted

May 26, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduced a Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy to reduce hallucinations in LVLMs. It adaptively strengthens the mutual dependency between generated text and input images by jointly modeling visual and textual tokens in a bi-level optimization problem.

Business Value

Enhances the reliability and trustworthiness of multimodal AI systems, enabling more accurate image captioning, visual question answering, and other vision-language applications.

Paper Metadata

Innovation Type

Algorithmic/Decoding Strategy

Deployment Feasibility

Moderate, as it's a decoding strategy that needs to be integrated into existing LVLM inference pipelines.

Limitations Addressed

Hallucinations in LVLMs, caused by over-reliance on language priors and disregard for visual information during decoding.

Performance Gains

Reduced hallucination rates and improved relevance of generated text to input images.

Technical Tags

LVLMsHallucinationsVision-LanguageDecoding StrategyMutual InformationImage-Text AlignmentNLPComputer Vision

Research Topics

Reducing Hallucinations in Multimodal LLMsGrounding Language in VisionConditional Mutual InformationMultimodal Decoding

Methods & Architectures

Conditional Pointwise Mutual Information (C-PMI) CalibrationBi-level OptimizationToken Purification MechanismJoint Modeling of Visual and Textual Tokens Large Vision-Language Models (LVLMs)

Applications & Tasks

Computer Vision Natural Language Processing Multimodal AI Image Understanding Hallucinations in LVLMsOver-reliance on language priorsDisregarding visual information during decoding Reducing hallucinationsStrengthening image-text dependencyCalibrated decoding

Related Fields

Computer VisionNatural Language ProcessingMultimodal Machine LearningArtificial IntelligenceDeep Learning

Keywords

LVLMHallucinationsVision-LanguageDecodingMutual InformationImage-TextGroundingNLPComputer VisionAI Reliability

Academic Context

#Reducing Hallucinations in Multimodal LLMs#Grounding Language in Vision#Conditional Mutual Information#Multimodal Decoding

Commercial Potential

Potential Products

More accurate image captioning systemsReliable visual question answering toolsAI for content generation grounded in images

Target Industries

TechnologyMediaE-commerceContent Creation

Use Case Examples

Generating accurate descriptions for imagesAnswering questions about visual contentCreating AI systems that understand and describe scenes

Competitive Edge

Offers a novel decoding strategy specifically designed to combat hallucinations in LVLMs by explicitly modeling the image-text dependency.

Market Opportunity

Rapidly growing market for multimodal AI applications.

Revenue Models

Licensing the decoding strategyintegration into AI platforms.

Resource Requirements

Compute Needs

Moderate, as it's a decoding strategy applied during inference.

Data Requirements

Paired image-text data for training/fine-tuning LVLMs.

Deployment Constraints

Requires integration into the decoding process of LVLMs; computational overhead during inference.

Scalability

Scalability depends on the underlying LVLM architecture and the efficiency of the C-PMI calculation.

Regulatory Considerations

Ethical use of AI-generated content.

Production Readiness

Maturity Level

Research/Development

Time to Market

2-4 years for widespread adoption in LVLM products.

Licensing

Likely research-focused.

Patent Potential

Moderate, for the C-PMI calibrated decoding strategy.

View Full Paper Back to Papers