arxiv_ai 97% Match Research MLLM Researchers,Computer Vision Engineers,NLP Engineers,AI Fairness Researchers 1 week ago

Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

large-language-models › multimodal-llms

📄 Abstract

Abstract: Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

Authors (4)

Xinhan Zheng

Huyu Wu

Xueting Wang

Haiyun Jiang

Submitted

October 30, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This paper proposes that the text bias in MLLMs originates from the model's internal architecture, specifically that visual key vectors are out-of-distribution relative to the text key space. Using attention key-space analysis on LLaVA and Qwen2.5-VL, it provides direct evidence that visual keys receive lower similarity scores, leading to under-utilization.

Business Value

Improves the development of more balanced and capable multimodal AI systems. By understanding and mitigating text bias, MLLMs can better leverage visual information, leading to more accurate and reliable applications in areas like image captioning and visual question answering.

Paper Metadata

Innovation Type

Analytical Framework

Deployment Feasibility

N/A (Analytical paper), but findings guide model development.

Limitations Addressed

The text bias in MLLMs, which hinders effective reasoning from visual evidence, and prior studies attributing this bias solely to external factors.

Performance Gains

Provides a novel explanation for text bias in MLLMs.,Offers a method for analyzing internal model representations.

Technical Tags

Multimodal Large Language Models (MLLMs)Text biasVision-language dataAttention mechanismKey-space analysisVisual key vectorsOut-of-distribution (OOD)LLaVAQwen2.5-VLJensen-Shannon divergencet-SNE visualization

Research Topics

Multimodal AILLM BiasComputer VisionNatural Language ProcessingModel Interpretability

Methods & Architectures

Attention key-space analysisDistributional analysis (t-SNE, Jensen-Shannon divergence)Qualitative and quantitative analysisModel introspection LLaVAQwen2.5-VL

Applications & Tasks

Multimodal AI Computer Vision Natural Language Processing AI Fairness Understanding the source of text bias in MLLMsQuantifying the under-utilization of visual informationAnalyzing internal model representations Diagnosing text bias in MLLMsAnalyzing attention mechanismsComparing visual and textual key vector distributions

Datasets & Benchmarks

Datasets

vision-language data

Benchmarks

Direct evidence of visual keys being OOD relative to text key space. • Visual keys receive systematically lower similarity scores.

Jensen-Shannon divergencet-SNE visualizationSimilarity scores

Related Fields

Computer VisionNatural Language ProcessingMachine LearningAI InterpretabilityMultimodal Learning

Keywords

multimodal LLMtext biasvision-languageattention mechanismkey-space analysisLLaVAQwen2.5-VLOODinterpretabilityvisual reasoning

Academic Context

#Multimodal AI#LLM Bias#Computer Vision#Natural Language Processing#Model Interpretability

Commercial Potential

Potential Products

More robust multimodal AI modelsTools for diagnosing bias in vision-language models

Target Industries

TechnologyAI ResearchMediaE-commerce

Use Case Examples

Image captioning systems that rely more on visual details than textual descriptions.Visual question answering systems that can accurately interpret complex scenes.Content moderation tools that can detect harmful content in images and associated text.

Competitive Edge

Offers a novel internal architectural explanation for a known problem (text bias) in MLLMs, contrasting with external factor explanations.

Market Opportunity

Rapid growth in the multimodal AI market.

Revenue Models

N/A

Resource Requirements

Compute Needs

Significant compute for analyzing large MLLMs (e.g., LLaVA, Qwen2.5-VL).

Data Requirements

Large-scale multimodal datasets used for training MLLMs.

Deployment Constraints

Complexity of MLLM architectures, difficulty in isolating bias sources.

Scalability

Analysis methods are applicable to various MLLM architectures.

Regulatory Considerations

AI fairness and bias regulations.

Production Readiness

Maturity Level

Research

Time to Market

N/A

View Full Paper Back to Papers