arxiv_cv 95% Match Benchmark and Dataset AI researchers,Developers of multimodal systems,Engineers working on wearable technology 5 days ago

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

large-language-models › multimodal-llms

📄 Abstract

Abstract: Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

Authors (41)

Jiaqi Wang

Xiao Yang

Kai Sun

Parth Suresh

Sanat Sharma

Adam Czyzewski

+35 more

Submitted

October 30, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

CRAG-MM introduces the first comprehensive benchmark for Multi-Modal Retrieval-Augmented Generation (MM-RAG) specifically designed for wearable device scenarios. It includes a large dataset of egocentric images, multi-turn conversations, and diverse challenges like image quality issues and varying information dynamism.

Business Value

Facilitates the development and evaluation of more capable AI assistants for wearable devices, leading to improved user interaction and access to information in real-time, context-aware scenarios.

Paper Metadata

Innovation Type

Benchmark and Dataset

Deployment Feasibility

The benchmark itself is a research artifact. The MM-RAG systems it aims to evaluate are complex and require significant computational resources.

Limitations Addressed

Lack of comprehensive benchmarks for MM-RAG, especially for wearables,Scarcity of diverse datasets reflecting real-world scenarios,Challenges in multi-turn visual dialogue

Technical Tags

Multi-modal RAGBenchmarkWearable devicesEgocentric imagesMulti-turn conversationVisual groundingInformation retrievalGenerative AI

Research Topics

Multimodal AIRetrieval-Augmented Generation (RAG)Natural Language ProcessingComputer VisionHuman-Computer InteractionBenchmark Creation

Methods & Architectures

Benchmark dataset creationMulti-modal RAG evaluationMulti-turn conversation modeling

Applications & Tasks

Wearable computing Smart glasses Personal assistants Context-aware AI Evaluating multi-modal RAG systemsHandling complex visual questionsMaintaining context in multi-turn conversationsAddressing image quality issues Multi-modal question answeringMulti-turn dialogue generationVisual groundingInformation retrieval from images and text

Datasets & Benchmarks

Datasets

CRAG-MM

Related Fields

Artificial IntelligenceNatural Language ProcessingComputer VisionHuman-Computer InteractionUbiquitous Computing

Keywords

MM-RAGBenchmarkMultimodalWearable DevicesSmart GlassesEgocentric VisionMulti-turn ConversationVisual Question AnsweringRetrieval-Augmented GenerationDatasetAI Assistant

Academic Context

#Multimodal AI#Retrieval-Augmented Generation (RAG)#Natural Language Processing#Computer Vision#Human-Computer Interaction#Benchmark Creation

Commercial Potential

Potential Products

Advanced AI assistants for smart glassesContext-aware information retrieval systems

Target Industries

Consumer ElectronicsTechnologyWearable TechnologySoftware Development

Use Case Examples

Answering questions about objects seen through smart glassesProviding contextual information during a conversationGuiding users through tasks based on visual input

Competitive Edge

Establishes a new standard and dataset for evaluating MM-RAG systems in a challenging, real-world context (wearables), driving progress in the field.

Resource Requirements

Compute Needs

High (for training and evaluating MM-RAG models)

Data Requirements

The CRAG-MM dataset itself.

Deployment Constraints

Computational cost, latency, power consumption on wearable devices.

Scalability

The benchmark can be used to evaluate scalable MM-RAG models.

Production Readiness

Maturity Level

Research

Time to Market

Long

View Full Paper Back to Papers