Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 92% Match Research Paper AI Researchers,Computer Vision Engineers,Multimodal AI Developers,Cognitive Scientists 3 weeks ago

SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

large-language-models › multimodal-llms
📄 Abstract

Abstract: Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.
Authors (3)
Sifan Li
Yujun Cai
Yiwei Wang
Submitted
June 3, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper introduces HC-Bench, a benchmark for evaluating VLMs' semantic understanding of optical illusions, revealing near-zero accuracy. It proposes SemVink, a simple yet effective method of scaling images to low resolutions (32-128 pixels) that dramatically improves accuracy (>99%), highlighting a critical flaw in VLMs' over-reliance on abstract reasoning over low-level visual operations.

Business Value

Enhances the reliability and robustness of visual AI systems, enabling applications that require nuanced image interpretation beyond simple object recognition.