arxiv_ai 95% Match Research Paper Researchers in multi-modal AI,LLM developers,AI engineers working on generative models,AI safety researchers 2 weeks ago

PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning

large-language-models › multimodal-llms

📄 Abstract

Abstract: While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model's attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model's focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don't require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.

Authors (8)

Fengyuan Sun

Hui Chen

Xinhao Xu

Dandan Zheng

Jingdong Chen

Jun Zhou

+2 more

Submitted

October 22, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes PruneHal, a training-free method that uses adaptive KV cache pruning to reduce hallucinations in Multi-modal Large Language Models (MLLMs). It addresses the issue of attention dispersion caused by redundant visual tokens, which leads to critical visual cues being under-attended. This method enhances the model's focus on informative visual inputs without requiring additional training or external information during inference.

Business Value

Improves the trustworthiness and accuracy of MLLMs, making them more suitable for real-world applications where factual correctness is essential, such as content creation, image analysis, and assistive technologies.

Paper Metadata

Innovation Type

Algorithmic (Inference-time)

Deployment Feasibility

High. Being a training-free, inference-time method, it can be readily applied to existing MLLMs with minimal overhead.

Limitations Addressed

Hallucinations in MLLMs,Extra computational costs from additional training or inference-time information,Insufficient attention to critical visual tokens,Attention dispersion due to redundant visual tokens

Technical Tags

Multi-modal LLMsHallucinationsKV Cache PruningAttention MechanismVisual TokensTraining-freeInference OptimizationGenerative AI

Research Topics

Multi-modal LearningLarge Language ModelsGenerative AIAI Safety and Reliability

Methods & Architectures

Adaptive KV Cache PruningAttention AnalysisTraining-free Inference Method Multi-modal Large Language Models (MLLMs)Transformer

Applications & Tasks

Image Captioning Visual Question Answering Multi-modal Content Generation Reducing hallucinations in MLLMsImproving focus on critical visual informationMitigating attention dispersion Generating accurate multi-modal contentEnhancing the reliability of MLLMsImproving visual grounding in MLLMs

Related Fields

Multi-modal AILarge Language ModelsGenerative AIComputer VisionAI Safety

Keywords

Multi-modal LLMHallucinationKV Cache PruningAttentionVisual GroundingGenerative AITraining-freeInferenceTransformerAI ReliabilityComputer Vision

Academic Context

#Multi-modal Learning#Large Language Models#Generative AI#AI Safety and Reliability

Technology Stack

Frameworks & Libraries

PyTorchTensorFlow

Programming Languages

Python

ML Infrastructure

GPU acceleration

Commercial Potential

Potential Products

Plugins or modules to reduce MLLM hallucinationsEnhanced MLLM APIs with improved reliability

Target Industries

Media and EntertainmentE-commerceTechnologyHealthcare (e.g., medical image description)

Use Case Examples

Generating accurate image captions for accessibilityCreating reliable multi-modal content for marketingImproving visual question answering systems for education

Competitive Edge

Offers a unique training-free approach to mitigate hallucinations by optimizing attention through KV cache pruning, differentiating it from methods requiring additional training data or complex inference-time modules.

Market Opportunity

Rapid growth in multi-modal AI applications.

Revenue Models

Licensing of the PruneHal moduleintegration into MLLM platforms.

Resource Requirements

Compute Needs

Moderate for applying the pruning technique during inference; training requirements depend on the base MLLM.

Data Requirements

Requires access to visual data and corresponding text for MLLM training/fine-tuning.

Deployment Constraints

The effectiveness might vary across different MLLM architectures. Potential for slight increase in inference latency.

Scalability

The pruning mechanism itself is computationally efficient and should scale well with model size.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into MLLM frameworks.

Patent Potential

Moderate, for the specific KV cache pruning technique applied to MLLMs.

View Full Paper Back to Papers