Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Humans can naturally identify, reason about, and explain anomalies in their
environment. In computer vision, this long-standing challenge remains limited
to industrial defects or unrealistic, synthetically generated anomalies,
failing to capture the richness and unpredictability of real-world anomalies.
In this work, we introduce CAVE, the first benchmark of real-world visual
anomalies. CAVE supports three open-ended tasks: anomaly description,
explanation, and justification; with fine-grained annotations for visual
grounding and categorizing anomalies based on their visual manifestations,
their complexity, severity, and commonness. These annotations draw inspiration
from cognitive science research on how humans identify and resolve anomalies,
providing a comprehensive framework for evaluating Vision-Language Models
(VLMs) in detecting and understanding anomalies. We show that state-of-the-art
VLMs struggle with visual anomaly perception and commonsense reasoning, even
with advanced prompting strategies. By offering a realistic and cognitively
grounded benchmark, CAVE serves as a valuable resource for advancing research
in anomaly detection and commonsense reasoning in VLMs.
Authors (6)
Rishika Bhagwatkar
Syrielle Montariol
Angelika Romanou
Beatriz Borges
Irina Rish
Antoine Bosselut
Submitted
October 29, 2025
2025 Conference on Empirical Methods in Natural Language
Processing
Key Contributions
CAVE introduces the first benchmark for real-world visual anomalies, supporting tasks like description, explanation, and justification. It provides fine-grained annotations inspired by cognitive science, enabling a comprehensive evaluation of Vision-Language Models (VLMs) in detecting and understanding anomalies, revealing current struggles in visual anomaly perception and commonsense reasoning.
Business Value
Enables the development of more robust and intelligent AI systems capable of understanding and reacting to unexpected situations in real-world environments, crucial for safety and reliability.