VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, Cihang Xie, Jianfei Cai, Hamid Rezatofighi

2025-11-26

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Summary

This paper focuses on a new challenge in artificial intelligence called Visual Question-Visual Answering, or VQ-VA, where the goal is for a computer to *create* an image as the answer to a question about a picture, instead of just writing out an answer. This is something new that’s starting to appear in advanced AI systems like NanoBanana and GPT-Image.

What's the problem?

Currently, the ability to generate images as answers to visual questions is mostly limited to AI systems that aren’t publicly available. This means researchers and developers can’t easily study or build upon this technology. The problem is a lack of open-source data and models capable of performing VQ-VA well.

What's the solution?

To address this, the researchers created a system called VQ-VA World. This system automatically gathers a huge dataset of about 1.8 million images and related text descriptions from the internet. They also built a way to test these systems, called IntelligentBench, which specifically checks how well an AI understands the world, design, and reasoning when creating images. They then used this data to train an open-source model called LightFusion, significantly improving its performance.

Why it matters?

This work is important because it opens up the field of VQ-VA to the broader AI community. By releasing the data, the testing benchmark, and the improved model, other researchers can now build on this work and develop even more advanced AI systems that can visually answer questions, closing the gap between closed-source and open-source capabilities.

Abstract

This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.

View Paper