arxiv_cv 95% Match Research Paper Machine learning researchers,Information retrieval specialists,Developers of search engines and content management systems 2 days ago

RzenEmbed: Towards Comprehensive Multimodal Retrieval

large-language-models › multimodal-llms

📄 Abstract

Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.

Authors (7)

Weijian Jian

Yajun Zhang

Dawei Liang

Chunyu Xie

Yixiao He

Dawei Leng

+1 more

Submitted

October 31, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

RzenEmbed introduces a unified framework for learning embeddings across diverse modalities (text, images, videos, visual documents), extending CLIP-based approaches. It employs a novel two-stage training strategy with an enhanced InfoNCE loss, incorporating hardness weighting and false negative mitigation to learn discriminative representations.

Business Value

Enables more comprehensive and accurate search and retrieval across different types of digital content, improving user experience and efficiency in managing large, diverse datasets.

Paper Metadata

Innovation Type

Methodological/Framework

Deployment Feasibility

Feasible, builds upon existing MLLM architectures. Requires significant computational resources for training.

Limitations Addressed

Addresses the limited support of existing multimodal models for modalities beyond natural images (e.g., videos, visual documents) and improves the discriminative power of learned embeddings through advanced loss functions and training strategies.

Technical Tags

multimodal retrievaluniversal embeddingsCLIP-based frameworksvisual documentsvideostextimagestwo-stage trainingInfoNCE losshardness-weightedRzenEmbed

Research Topics

Multimodal LearningInformation RetrievalEmbedding LearningLarge Language ModelsCross-modal Understanding

Methods & Architectures

two-stage training strategyInfoNCE losshardness-weighted mechanismfalse negative mitigation CLIP-based frameworksMultimodal Large Language Models (MLLMs)

Applications & Tasks

Information Retrieval Content Management Digital Libraries E-commerce Limited Support for Diverse Visual ModalitiesFocus on Natural ImagesIneffective Embeddings for Videos/DocumentsData Imbalance and Noise Multimodal RetrievalCross-modal SearchUnified Embedding Generation

Related Fields

Natural Language ProcessingComputer VisionInformation RetrievalMachine LearningMultimodal AI

Keywords

multimodal retrievalembeddingsCLIPMLLMvideosdocumentstextimagesInfoNCE losshardness weightingRzenEmbedcross-modal

Academic Context

#Multimodal Learning#Information Retrieval#Embedding Learning#Large Language Models#Cross-modal Understanding

Commercial Potential

Potential Products

Universal search engineContent recommendation systemDigital asset management tool

Target Industries

TechnologyMediaE-commercePublishingArchiving

Use Case Examples

Searching for specific scenes in videos using text descriptionsFinding relevant documents based on image contentBuilding a unified search across a company's knowledge base (text, images, videos)

Competitive Edge

Extends the capabilities of existing CLIP-like models to a broader range of modalities, offering a more comprehensive solution for multimodal retrieval.

Market Opportunity

Large and growing market for AI-powered search and recommendation systems.

Revenue Models

API serviceslicensing of embedding modelsenterprise solutions.

Resource Requirements

Compute Needs

High, requires significant GPU resources for training large multimodal models.

Data Requirements

Requires large, diverse datasets covering text, images, videos, and visual documents with aligned labels.

Deployment Constraints

Inference can be computationally intensive, especially for real-time applications.

Scalability

Scalable to large datasets and a wide range of modalities. Performance depends on the quality and quantity of training data.

Production Readiness

Maturity Level

Research/Prototype

Time to Market

2-4 years

View Full Paper Back to Papers