Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for
enhancing large language models (LLMs) by retrieving relevant documents from an
external corpus. However, existing RAG systems primarily focus on unimodal text
documents, and often fall short in real-world scenarios where both queries and
documents may contain mixed modalities (such as text and images). In this
paper, we address the challenge of Universal Retrieval-Augmented Generation
(URAG), which involves retrieving and reasoning over mixed-modal information to
improve vision-language generation. To this end, we propose Nyx, a unified
mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate
the scarcity of realistic mixed-modal data, we introduce a four-stage automated
pipeline for generation and filtering, leveraging web documents to construct
NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that
better reflect real-world information needs. Building on this high-quality
dataset, we adopt a two-stage training framework for Nyx: we first perform
pre-training on NyxQA along with a variety of open-source retrieval datasets,
followed by supervised fine-tuning using feedback from downstream
vision-language models (VLMs) to align retrieval outputs with generative
preferences. Experimental results demonstrate that Nyx not only performs
competitively on standard text-only RAG benchmarks, but also excels in the more
general and realistic URAG setting, significantly improving generation quality
in vision-language tasks.
Authors (4)
Chenghao Zhang
Guanting Dong
Xinyu Yang
Zhicheng Dou
Submitted
October 20, 2025
Key Contributions
This paper introduces Nyx, a unified mixed-modal to mixed-modal retriever for Universal Retrieval-Augmented Generation (URAG) scenarios. It addresses the scarcity of mixed-modal data by proposing an automated pipeline to construct the NyxQA dataset, which better reflects real-world information needs for vision-language generation.
Business Value
Enables more sophisticated AI assistants and search engines that can understand and generate content from diverse data types (text, images), leading to richer user experiences and more accurate information retrieval.