SEED-Story: Multimodal Long Story Generation with Large Language Model
Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen
2024-07-13

Summary
This paper introduces SEED-Story, a new method that uses a Multimodal Large Language Model (MLLM) to create long stories that include both text and images. The goal is to generate engaging narratives with visuals that are consistent in style and characters.
What's the problem?
Creating stories that effectively combine text and images is challenging because it requires understanding how these two forms of information interact. Many existing methods struggle to produce coherent and contextually relevant stories over long sequences, which can lead to disjointed narratives or mismatched visuals.
What's the solution?
SEED-Story addresses these challenges by using a powerful MLLM that can predict both text and visual elements. It includes a special mechanism called a multimodal attention sink, which helps the model generate up to 25 sequences of story content, even though it was only trained on 10 sequences. Additionally, the researchers created a large dataset called StoryStream to train the model and evaluate its performance in generating multimodal stories.
Why it matters?
This research is important because it enhances the ability of AI to create rich, multimodal narratives that can be used in various applications, such as storytelling, education, and entertainment. By improving how machines generate stories with both text and images, SEED-Story can lead to more engaging and immersive experiences for users.
Abstract
With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant challenges, as it necessitates the comprehension of the complex interplay between texts and images, and the ability to generate long sequences of coherent, contextually relevant texts and visuals. In this work, we propose SEED-Story, a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. Our model, built upon the powerful comprehension capability of MLLM, predicts text tokens as well as visual tokens, which are subsequently processed with an adapted visual de-tokenizer to produce images with consistent characters and styles. We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. Additionally, we present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.