Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in
generating photorealistic and prompt-aligned images, but they often produce
outputs that contradict verifiable knowledge, especially when prompts involve
fine-grained attributes or time-sensitive events. Conventional
retrieval-augmented approaches attempt to address this issue by introducing
external information, yet they are fundamentally incapable of grounding
generation in accurate and evolving knowledge due to their reliance on static
sources and shallow evidence integration. To bridge this gap, we introduce
ORIG, an agentic open multimodal retrieval-augmented framework for Factual
Image Generation (FIG), a new task that requires both visual realism and
factual grounding. ORIG iteratively retrieves and filters multimodal evidence
from the web and incrementally integrates the refined knowledge into enriched
prompts to guide generation. To support systematic evaluation, we build
FIG-Eval, a benchmark spanning ten categories across perceptual, compositional,
and temporal dimensions. Experiments demonstrate that ORIG substantially
improves factual consistency and overall image quality over strong baselines,
highlighting the potential of open multimodal retrieval for factual image
generation.
Authors (6)
Yang Tian
Fan Liu
Jingyuan Zhang
Wei Bi
Yupeng Hu
Liqiang Nie
Submitted
October 26, 2025
Key Contributions
Introduces ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG). ORIG addresses the factual inconsistency of LMMs by iteratively retrieving, filtering, and integrating multimodal evidence from the web into enriched prompts, enabling generation of images that are both visually realistic and factually grounded.
Business Value
Enables the creation of highly reliable and accurate visual content for applications requiring factual precision, such as news reporting, educational materials, and scientific visualization. This reduces the risk of misinformation and enhances trust in AI-generated visuals.