Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Due to the unidirectional masking mechanism, Decoder-Only models propagate
information from left to right. LVLMs (Large Vision-Language Models) follow the
same architecture, with visual information gradually integrated into semantic
representations during forward propagation. Through systematic analysis, we
observe that the majority of the visual information is absorbed into the
semantic representations. However, the model's attention distribution does not
exhibit sufficient emphasis on semantic representations. This misalignment
between the attention distribution and the actual information flow undermines
the model's visual understanding ability and contributes to hallucinations. To
address this issue, we enhance the model's visual understanding by leveraging
the core information embedded in semantic representations. Specifically, we
identify attention heads that focus on core semantic representations based on
their attention distributions. Then, through a two-stage optimization paradigm,
we propagate the advantages of these attention heads across the entire model,
aligning the attention distribution with the actual information flow. We
evaluate our method on three image captioning benchmarks using five different
LVLMs, demonstrating its effectiveness in significantly reducing
hallucinations. Further experiments reveal a trade-off between reduced
hallucinations and richer details. Notably, our method allows for manual
adjustment of the model's conservativeness, enabling flexible control to meet
diverse real-world requirements.