Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper ML Researchers,NLP Researchers,Computer Vision Researchers,Developers of MLLM applications 6 days ago

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

large-language-models β€Ί multimodal-llms
πŸ“„ Abstract

Abstract: While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.
Authors (7)
Liangyu Zhong
Fabio Rosenthal
Joachim Sicking
Fabian HΓΌger
Thorsten Bagdonat
Hanno Gottschalk
+1 more
Submitted
June 26, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

FOCUS proposes a novel training-free method for efficient fine-grained VQA by leveraging internal MLLM representations (KV cache) to guide visual cropping. This approach overcomes limitations of task-specific fine-tuning and exhaustive search, significantly improving efficiency and performance on detailed visual queries.

Business Value

Enables more efficient and accurate AI systems for tasks requiring detailed image understanding, such as automated quality control, medical diagnosis assistance, or enhanced search functionalities.