MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa
2025-05-29
Summary
This paper talks about new tools and tests designed to help computers better understand manga, which are Japanese comic books that mix pictures and text to tell stories. The researchers created special benchmarks and a new model to see how well AI can read and make sense of manga.
What's the problem?
The problem is that most AI systems are not very good at understanding manga because it combines images, speech bubbles, and unique storytelling styles. Regular models often struggle to connect the text with the right parts of the pictures or to follow the story as a whole.
What's the solution?
To fix this, the researchers made two benchmarks, MangaOCR and MangaVQA, which are like tests for AI to see how well they can read and answer questions about manga. They also built a specialized AI model called MangaLMM that is trained to handle the unique mix of images and words in manga, making it better at understanding the stories and characters.
Why it matters?
This is important because it helps improve AI's ability to understand and interact with comics and graphic novels, making it possible to create smarter reading assistants, better translation tools, or even new ways to enjoy manga for people all over the world.
Abstract
Two new benchmarks, MangaOCR and MangaVQA, and a specialized model, MangaLMM, are introduced to evaluate and advance large multimodal models in understanding manga narratives.