Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names
Ragav Sachdeva, Gyungin Shin, Andrew Zisserman
2024-08-02

Summary
This paper discusses a new method called Magiv2 that automatically creates transcripts for entire manga chapters, making them accessible for visually impaired readers. It focuses on accurately identifying dialogue and attributing it to the correct characters.
What's the problem?
Manga is primarily a visual medium, which makes it difficult for visually impaired individuals to enjoy. Existing methods for transcribing manga often struggle with accurately detecting text and identifying which character is speaking, leading to confusion and inconsistency in the transcripts.
What's the solution?
To tackle these challenges, the authors developed Magiv2, a model that generates high-quality transcripts by detecting text on each page and classifying it as essential or non-essential. It also attributes dialogue to specific characters, ensuring consistent naming throughout the chapter. Magiv2 uses a character bank with over 11,000 characters from various manga series to improve accuracy. Additionally, it incorporates new datasets that help the model understand speech-bubble tails, which indicate who is speaking, enhancing the overall quality of the transcripts.
Why it matters?
This research is important because it significantly improves accessibility for visually impaired manga readers, allowing them to engage with stories that were previously difficult to access. By providing accurate and consistent transcripts, Magiv2 not only helps individuals enjoy manga but also sets a precedent for making other visual media more accessible in the future.
Abstract
Enabling engagement of manga by visually impaired individuals presents a significant challenge due to its inherently visual nature. With the goal of fostering accessibility, this paper aims to generate a dialogue transcript of a complete manga chapter, entirely automatically, with a particular emphasis on ensuring narrative consistency. This entails identifying (i) what is being said, i.e., detecting the texts on each page and classifying them into essential vs non-essential, and (ii) who is saying it, i.e., attributing each dialogue to its speaker, while ensuring the same characters are named consistently throughout the chapter. To this end, we introduce: (i) Magiv2, a model that is capable of generating high-quality chapter-wide manga transcripts with named characters and significantly higher precision in speaker diarisation over prior works; (ii) an extension of the PopManga evaluation dataset, which now includes annotations for speech-bubble tail boxes, associations of text to corresponding tails, classifications of text as essential or non-essential, and the identity for each character box; and (iii) a new character bank dataset, which comprises over 11K characters from 76 manga series, featuring 11.5K exemplar character images in total, as well as a list of chapters in which they appear. The code, trained model, and both datasets can be found at: https://github.com/ragavsachdeva/magi