LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li

2025-02-21

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in
Vision-Language Models

Summary

This paper talks about LongWriter-V, a new system that helps AI models write very long and detailed text based on images and instructions. It improves how well AI can handle tasks that require generating thousands of words while staying accurate and connected to the input images.

What's the problem?

Current vision-language models struggle to write coherent and meaningful text when the output needs to be longer than 1,000 words. This happens because these models aren't trained with enough examples of long outputs, which limits their ability to handle complex, detailed tasks.

What's the solution?

The researchers created a large dataset called LongWriter-V-22k, which includes over 22,000 examples of tasks requiring outputs up to 10,000 words. They also developed a method called IterDPO that breaks long outputs into smaller parts for easier training and uses human feedback to improve the AI's performance. They tested their model on a new benchmark called MMLongBench-Write and showed that it outperformed larger, more expensive models like GPT-4o in generating high-quality long text.

Why it matters?

This matters because it enables AI to handle more complex writing tasks, like creating detailed reports or stories, which could be useful in fields like education, journalism, or professional documentation. By improving how AI generates long outputs while staying accurate, this research helps make AI more practical and reliable for real-world applications.

Abstract

Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V

View Paper