Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision-language models (VLMs) excel in semantic tasks but falter at a core
human capability: detecting hidden content in optical illusions or AI-generated
images through perceptual adjustments like zooming. We introduce HC-Bench, a
benchmark of 112 images with hidden text, objects, and illusions, revealing
that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit
prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to
an overreliance on high-level semantics. Strikingly, we propose SemVink
(Semantic Visual Thinking) by simply scaling images to low resolutions (32-128
pixels), which unlocks >99% accuracy by eliminating redundant visual noise.
This exposes a critical architectural flaw: VLMs prioritize abstract reasoning
over low-level visual operations crucial for real-world robustness. Our work
urges a shift toward hybrid models integrating multi-scale processing, bridging
the gap between computational vision and human cognition for applications in
medical imaging, security, and beyond.
Authors (3)
Sifan Li
Yujun Cai
Yiwei Wang
Key Contributions
This paper introduces HC-Bench, a benchmark for evaluating VLMs' semantic understanding of optical illusions, revealing near-zero accuracy. It proposes SemVink, a simple yet effective method of scaling images to low resolutions (32-128 pixels) that dramatically improves accuracy (>99%), highlighting a critical flaw in VLMs' over-reliance on abstract reasoning over low-level visual operations.
Business Value
Enhances the reliability and robustness of visual AI systems, enabling applications that require nuanced image interpretation beyond simple object recognition.