Open Multimodal Retrieval-Augmented Factual Image Generation

Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yupeng Hu, Liqiang Nie

2025-10-28

Open Multimodal Retrieval-Augmented Factual Image Generation

Summary

This paper introduces a new approach to creating images from text descriptions, focusing on making sure the images are not only realistic but also factually correct.

What's the problem?

Current image generation models, while good at making pretty pictures based on what you ask for, often get the details wrong, especially when it comes to specific facts or things that change over time. Simply adding information from outside sources doesn't fully fix this because that information is often outdated or not used effectively during image creation.

What's the solution?

The researchers developed a system called ORIG that acts like an agent searching the internet for relevant information. It doesn't just grab information once, but repeatedly searches, filters, and refines the knowledge it finds. This improved knowledge is then used to guide the image generation process, making the final image more accurate and consistent with real-world facts. They also created a new way to test these models, called FIG-Eval, to specifically measure factual correctness alongside image quality.

Why it matters?

This work is important because it moves image generation beyond just looking good to also being truthful. This is crucial for applications where accuracy matters, like educational materials, news reporting, or any situation where misleading images could be harmful. It shows that actively searching for and integrating knowledge during the image creation process can significantly improve factual consistency.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.

View Paper