Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: While Multimodal Large Language Models (MLLMs) offer strong perception and
reasoning capabilities for image-text input, Visual Question Answering (VQA)
focusing on small image details still remains a challenge. Although visual
cropping techniques seem promising, recent approaches have several limitations:
the need for task-specific fine-tuning, low efficiency due to uninformed
exhaustive search, or incompatibility with efficient attention implementations.
We address these shortcomings by proposing a training-free visual cropping
method, dubbed FOCUS, that leverages MLLM-internal representations to guide the
search for the most relevant image region. This is accomplished in four steps:
first, we identify the target object(s) in the VQA prompt; second, we compute
an object relevance map using the key-value (KV) cache; third, we propose and
rank relevant image regions based on the map; and finally, we perform the
fine-grained VQA task using the top-ranked region. As a result of this informed
search strategy, FOCUS achieves strong performance across four fine-grained VQA
datasets and three types of MLLMs. It outperforms three popular visual cropping
methods in both accuracy and efficiency, and matches the best-performing
baseline, ZoomEye, while requiring 3 - 6.5 x less compute.
Authors (7)
Liangyu Zhong
Fabio Rosenthal
Joachim Sicking
Fabian HΓΌger
Thorsten Bagdonat
Hanno Gottschalk
+1 more
Key Contributions
FOCUS proposes a novel training-free method for efficient fine-grained VQA by leveraging internal MLLM representations (KV cache) to guide visual cropping. This approach overcomes limitations of task-specific fine-tuning and exhaustive search, significantly improving efficiency and performance on detailed visual queries.
Business Value
Enables more efficient and accurate AI systems for tasks requiring detailed image understanding, such as automated quality control, medical diagnosis assistance, or enhanced search functionalities.