LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Zirun Guo, Feng Zhang, Kai Jia, Tao Jin

2025-09-18

LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Summary

This paper introduces a new system called LLM-Interleaved, or LLM-I, which is a smarter way to create images from text descriptions. It doesn't rely on just one AI model to do everything, but instead lets a main AI 'brain' choose from a variety of specialized 'tools' to build the image.

What's the problem?

Current AI models that try to generate images and understand text at the same time often struggle. They're good at making pretty pictures, but they have trouble when the image needs to be factually correct or require precise details, like following specific instructions or using real-world knowledge. They're limited because they try to do everything themselves with a single approach.

What's the solution?

LLM-I works by having a large language model (LLM) act as a manager. This manager can decide when to use different tools like searching for images online, creating images from scratch using diffusion models, running code to generate parts of the image, or editing existing images. The system learns which tool to use and when through a process called reinforcement learning, getting feedback from both pre-defined rules and other AI models that judge the quality of the results. They trained it on a lot of different examples and tested it with several different base AI models.

Why it matters?

This research is important because it shows a more effective way to combine the strengths of different AI tools. By letting an AI 'orchestrate' other specialized AIs, it can create more accurate, detailed, and realistic images from text, going beyond what current all-in-one models can achieve. This could lead to better image generation for a wide range of applications.

Abstract

We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.

View Paper