MovieSum: An Abstractive Summarization Dataset for Movie Screenplays
Rohit Saxena, Frank Keller
2024-08-14

Summary
This paper presents MovieSum, a new dataset designed for summarizing movie screenplays, which helps improve the way AI understands and summarizes long texts.
What's the problem?
Summarizing movie screenplays is challenging because they are long and contain many unique elements that are specific to films. While AI has made progress in summarizing other types of documents, it often struggles with the complexity of movie scripts, especially since this area has not been studied as much as others like television transcripts.
What's the solution?
To tackle this problem, the authors created MovieSum, which includes 2,200 movie screenplays along with their corresponding Wikipedia plot summaries. They carefully formatted these screenplays to highlight their structure and made sure the dataset is larger than previous ones. This dataset allows researchers to train AI models to better summarize movie scripts by providing them with high-quality examples.
Why it matters?
This research is important because it fills a gap in the study of AI summarization techniques specifically for movies. By providing a rich dataset like MovieSum, it enables further research and development in this area, which can lead to better AI tools for filmmakers, scriptwriters, and even viewers who want quick summaries of films.
Abstract
Movie screenplay summarization is challenging, as it requires an understanding of long input contexts and various elements unique to movies. Large language models have shown significant advancements in document summarization, but they often struggle with processing long input contexts. Furthermore, while television transcripts have received attention in recent studies, movie screenplay summarization remains underexplored. To stimulate research in this area, we present a new dataset, MovieSum, for abstractive summarization of movie screenplays. This dataset comprises 2200 movie screenplays accompanied by their Wikipedia plot summaries. We manually formatted the movie screenplays to represent their structural elements. Compared to existing datasets, MovieSum possesses several distinctive features: (1) It includes movie screenplays, which are longer than scripts of TV episodes. (2) It is twice the size of previous movie screenplay datasets. (3) It provides metadata with IMDb IDs to facilitate access to additional external knowledge. We also show the results of recently released large language models applied to summarization on our dataset to provide a detailed baseline.