World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

Moo Hyun Son, Jintaek Oh, Sun Bin Mun, Jaechul Roh, Sehyun Choi

2025-10-14

World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

Summary

This paper introduces a new system called World-To-Image that helps AI create images from text descriptions, even when the descriptions include things the AI hasn't 'seen' before.

What's the problem?

Current AI image generators are really good, but they struggle when you ask them to create images of things they weren't trained on, like a brand new invention or a very specific, recently popular object. This is because their knowledge is limited to what they were shown during training, creating a knowledge cutoff. Essentially, they don't 'know' about things outside of their training data.

What's the solution?

The researchers created a system where an 'agent' automatically searches the internet for images related to the unfamiliar things mentioned in the text description. It then uses these found images to refine the text prompt, guiding the AI image generator to create a more accurate and visually appealing picture. This process doesn't take many tries, making it efficient.

Why it matters?

This work is important because it makes AI image generators more adaptable to the real world, which is constantly changing. It allows them to create images of new concepts and objects without needing to be constantly retrained. This means the AI can stay current and generate images that are more relevant and accurate, and the new evaluation methods used provide a better way to judge how well the AI is actually understanding and following the instructions.

Abstract

While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available herehttps://github.com/mhson-kyle/World-To-Image.

View Paper