Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 97% Match Research MLLM Researchers,Computer Vision Engineers,NLP Engineers,AI Fairness Researchers 1 week ago

Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

large-language-models › multimodal-llms
📄 Abstract

Abstract: Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.
Authors (4)
Xinhan Zheng
Huyu Wu
Xueting Wang
Haiyun Jiang
Submitted
October 30, 2025
arXiv Category
cs.AI
arXiv PDF

Key Contributions

This paper proposes that the text bias in MLLMs originates from the model's internal architecture, specifically that visual key vectors are out-of-distribution relative to the text key space. Using attention key-space analysis on LLaVA and Qwen2.5-VL, it provides direct evidence that visual keys receive lower similarity scores, leading to under-utilization.

Business Value

Improves the development of more balanced and capable multimodal AI systems. By understanding and mitigating text bias, MLLMs can better leverage visual information, leading to more accurate and reliable applications in areas like image captioning and visual question answering.