Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 85% Match Research Paper AI Researchers,Machine Learning Engineers,Computer Vision Engineers,NLP Researchers,Multimodal AI Developers 4 days ago

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

large-language-models › multimodal-llms
📄 Abstract

Abstract: Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
Authors (7)
Wu Wei
Xiaomeng Fan
Yuwei Wu
Zhi Gao
Pengxiang Li
Yunde Jia
+1 more
Submitted
October 31, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper proposes 'Alignment across Trees', a novel method for modality alignment in Vision-Language Models (VLMs) that constructs and aligns tree-like hierarchical features for both image and text. It introduces a semantic-aware visual feature extraction framework using cross-attention guided by text, and embeds these feature trees into heterogeneous hyperbolic manifolds with distinct curvatures, aligning them using a KL distance measure.

Business Value

Enables more sophisticated and nuanced understanding between visual and textual data, leading to better AI assistants, content analysis tools, and search engines.