One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng

2025-01-24

One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

Summary

This paper talks about a new method called One-Prompt-One-Story (1Prompt1Story) that helps AI create a series of images with consistent characters and scenes using just one long text description, instead of needing multiple separate prompts or extra training.

What's the problem?

Current AI models that turn text into images are great at making single pictures, but they have trouble keeping characters and objects consistent when creating multiple images for a story. The existing ways to fix this usually need a lot of extra training or changes to the AI model, which can be time-consuming and limit how widely these methods can be used.

What's the solution?

The researchers came up with 1Prompt1Story, which takes advantage of how language naturally keeps track of who's who in a story. Instead of using separate prompts for each image, they combine all the descriptions into one long prompt. They also added two new techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention. These help the AI focus on the right parts of the description for each image while keeping the characters consistent throughout the story.

Why it matters?

This matters because it makes it much easier and faster to create a series of images that tell a consistent story. It could be really useful for things like making storyboards for movies, creating illustrations for books, or even helping with animation. The best part is that it doesn't need any extra training, so it can be used right away with different types of AI image generators. This could make storytelling with AI-generated images more accessible and creative for everyone from professional artists to students working on school projects.

Abstract

Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.

View Paper