Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper AI Researchers,Machine Learning Engineers,Developers of multimodal systems 2 weeks ago

Scope: Selective Cross-modal Orchestration of Visual Perception Experts

large-language-models › multimodal-llms
📄 Abstract

Abstract: Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49\%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.
Authors (8)
Tianyu Zhang
Suyuchen Wang
Chao Wang
Juan Rodriguez
Ahmed Masry
Xiangru Jian
+2 more
Submitted
October 14, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

SCOPE proposes a Mixture-of-Encoders (MoEnc) framework with instance-level routing for VLMs, dynamically selecting specialized vision encoders per input pair. This approach significantly reduces inference costs (24-49%) while outperforming models using all encoders simultaneously, demonstrating the effectiveness of intelligent selection over brute-force aggregation.

Business Value

Enables the development of more powerful and efficient VLMs, making advanced multimodal AI applications more accessible and cost-effective for businesses.