Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 90% Match Research Paper Computer vision researchers,NLP researchers,ML researchers focused on robustness and fairness 1 week ago

Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

computer-vision › scene-understanding
📄 Abstract

Abstract: Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
Authors (5)
Pei Peng
MingKun Xie
Hang Hao
Tong Jin
ShengJun Huang
Submitted
October 30, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Recasts object-context shortcuts in VLMs as a causal inference problem and proposes representation-level counterfactual calibration. It synthesizes counterfactual embeddings by recombining object features with diverse alternative contexts, improving zero-shot recognition without retraining or prompt design.

Business Value

Enhances the reliability and fairness of AI systems that use vision-language understanding, making them more trustworthy in real-world scenarios where contexts can vary.