arxiv_cv 95% Match Research Paper ML Researchers,NLP Researchers,Computer Vision Researchers,Developers of MLLM applications 6 days ago

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

large-language-models › multimodal-llms

📄 Abstract

Abstract: While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.

Authors (7)

Liangyu Zhong

Fabio Rosenthal

Joachim Sicking

Fabian Hüger

Thorsten Bagdonat

Hanno Gottschalk

+1 more

Submitted

June 26, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

FOCUS proposes a novel training-free method for efficient fine-grained VQA by leveraging internal MLLM representations (KV cache) to guide visual cropping. This approach overcomes limitations of task-specific fine-tuning and exhaustive search, significantly improving efficiency and performance on detailed visual queries.

Business Value

Enables more efficient and accurate AI systems for tasks requiring detailed image understanding, such as automated quality control, medical diagnosis assistance, or enhanced search functionalities.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High. Being training-free and leveraging existing MLLM internals makes it readily applicable to current MLLM deployments.

Limitations Addressed

Need for task-specific fine-tuning in visual cropping,Low efficiency due to uninformed exhaustive search,Incompatibility with efficient attention implementations,Difficulty in fine-grained VQA with small image details

Performance Gains

Improved efficiency in VQA,Enhanced performance on fine-grained visual questions,Reduced computational overhead compared to exhaustive search methods

Technical Tags

Visual Question Answering (VQA)Multimodal LLMsfine-grained VQAvisual croppinginternal representationsKV cachetraining-freeefficient attention

Research Topics

Multimodal AILarge Language ModelsVisual Question AnsweringComputer VisionNatural Language ProcessingModel Efficiency

Methods & Architectures

Leveraging MLLM internal representations (KV cache)Object identification from promptObject relevance map generationRegion proposal and rankingTraining-free visual cropping Multimodal Large Language Models (MLLMs)Transformer-based architectures

Applications & Tasks

Image Understanding Accessibility Tools Robotics Content Moderation Education Fine-grained visual question answeringEfficiently focusing on relevant image regionsReducing computational cost in VQAImproving VQA performance on detailed image aspects Answering questions about specific details in imagesLocating and analyzing small objects or regions of interestImproving the efficiency of multimodal models for VQA

Related Fields

Large Language ModelsMultimodal AIComputer VisionNatural Language ProcessingMachine LearningAI Efficiency

Keywords

VQAmultimodal LLMfine-grainedvisual croppingKV cacheefficienttraining-freeattentionimage understandingAIdeep learning

Academic Context

#Multimodal AI#Large Language Models#Visual Question Answering#Computer Vision#Natural Language Processing#Model Efficiency

Technology Stack

Frameworks & Libraries

PyTorchHugging Face Transformers

Programming Languages

Python

Commercial Potential

Potential Products

More efficient VQA systemsAI assistants for detailed image analysisTools for automated content moderation

Target Industries

TechnologyE-commerceHealthcareMediaRobotics

Use Case Examples

Answering specific questions about a product in an image (e.g., 'What is the brand of the watch on the person's left wrist?').Assisting in medical image analysis by focusing on subtle anomalies.Improving search engines that use image-text queries.

Competitive Edge

Offers a more efficient and versatile approach to fine-grained VQA compared to methods requiring task-specific fine-tuning or computationally expensive exhaustive search.

Market Opportunity

Rapid growth in the MLLM market and demand for advanced VQA capabilities.

Revenue Models

Integration into existing AI platformslicensing of the FOCUS moduledevelopment of specialized VQA services.

Resource Requirements

Compute Needs

Moderate (inference on existing MLLMs)

Data Requirements

Standard VQA datasets, potentially augmented with fine-grained annotations.

Deployment Constraints

Relies on the capabilities and internal representations of the underlying MLLM.

Scalability

Scales with the efficiency and capabilities of the base MLLM.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into existing MLLM-based products.

Patent Potential

Moderate, for the method of using internal MLLM representations for guided cropping.

View Full Paper Back to Papers