arxiv_cv 95% Match Research Paper AI Researchers,ML Engineers,Benchmark Developers,Developers of MLLMs 3 weeks ago

BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

large-language-models › multimodal-llms

📄 Abstract

Abstract: Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception ("see") and requires fine-grained observation and analytical reasoning ("observe"). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice

Key Contributions

Introduces BLINK-Twice, a novel vision-centric reasoning benchmark for Multimodal Large Language Models (MLLMs) that shifts focus from language-based to image-grounded reasoning. It addresses the gap where existing benchmarks treat visual input as secondary and require fine-grained observation beyond shallow perception.

Business Value

Provides a standardized and rigorous way to evaluate the true visual reasoning capabilities of MLLMs, crucial for developing more reliable and capable AI systems in areas like robotics and autonomous driving.

Paper Metadata

Innovation Type

Benchmark/Dataset

Deployment Feasibility

N/A (Benchmark)

Limitations Addressed

Existing reasoning benchmarks primarily assess language-based reasoning, treating visual input as replaceable context. There's a lack of benchmarks that require models to reason from visual content alone and assess fine-grained observation capabilities.

Technical Tags

reasoning benchmarkvisual perceptionmultimodal large language models (MLLMs)image-grounded reasoningperceptual tasksadversarial imagesreasoning chainsfine-grained observation

Research Topics

Multimodal ReasoningVisual PerceptionBenchmark DevelopmentAI EvaluationCognitive Science

Methods & Architectures

Benchmark DesignAdversarial Image GenerationReasoning Chain Annotation Multimodal Large Language Models (MLLMs)

Applications & Tasks

AI Evaluation Robotics Autonomous Systems Human-AI Interaction Language-based Reasoning AssessmentVisual Input as Replaceable ContextShallow PerceptionLack of Image-Grounded Reasoning Benchmarks Vision-centric Reasoning EvaluationPerceptual Task Assessment

Datasets & Benchmarks

Datasets

BLINK-Twice

Benchmarks

BLINK-Twice (7 types of visual challenges)

Reasoning Chain Evaluation

Related Fields

Artificial IntelligenceMachine LearningComputer VisionNatural Language ProcessingCognitive ScienceAI Ethics

Keywords

multimodal LLMsreasoning benchmarkvisual perceptionimage-grounded reasoningMLLM evaluationperceptual tasksadversarial examplescomputer visionNLPAI benchmarks

Academic Context

#Multimodal Reasoning#Visual Perception#Benchmark Development#AI Evaluation#Cognitive Science

Commercial Potential

Potential Products

Evaluation suites for MLLMsTools for developing more robust visual reasoning models

Target Industries

TechnologyAI ResearchRoboticsAutonomous Systems

Use Case Examples

Testing the ability of an MLLM to understand complex visual scenesEvaluating the robustness of MLLMs to adversarial visual perturbationsBenchmarking the fine-grained observational skills of AI agents

Competitive Edge

Establishes a new standard for evaluating visual reasoning in MLLMs, moving beyond language-centric evaluations to truly assess image-grounded understanding.

Market Opportunity

Growing market for AI evaluation tools and benchmarks.

Revenue Models

Licensing of benchmark datasets and evaluation frameworks.

Resource Requirements

Compute Needs

N/A (Benchmark)

Data Requirements

The BLINK-Twice dataset.

Deployment Constraints

N/A (Benchmark)

Scalability

The benchmark can be extended with more complex visual challenges and adversarial examples.

Production Readiness

Maturity Level

Research (Benchmark)

Time to Market

N/A (Benchmark)

Patent Potential

Low (Benchmark)

View Full Paper Back to Papers