Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: A challenge in advancing Visual-Language Models (VLMs) is determining whether
their failures on abstract reasoning tasks, such as Bongard problems, stem from
flawed perception or faulty top-down reasoning. To disentangle these factors,
we introduce a diagnostic framework centered on the Linear Separability Ceiling
(LSC), the performance achievable by a linear classifier on a VLM's raw visual
embeddings. Applying this framework to state-of-the-art VLMs, we uncover a
pervasive "alignment gap", where most models fail to generatively outperform
the linear separability of their own representations. We find that the few
models surpassing this ceiling do so via two mechanisms: by further refining
visual representations into a more linearly separable format or by executing
non-linear decision logic. We demonstrate that this bottleneck is not a
fundamental limitation but a solvable alignment issue. By augmenting standard
next-token prediction with a contrastive objective, our fine-tuning method
activates dormant reasoning pathways, systematically improving the linear
structure of representations to significantly surpass the LSC.