PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

Rohit Saxena, Pasquale Minervini, Frank Keller

2025-02-27

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

Summary

This paper talks about PosterSum, a new tool for testing and improving how well AI can understand and summarize scientific posters. The researchers created a large collection of scientific posters with their summaries to help develop better AI systems for this task.

What's the problem?

Scientific posters are complex documents with text, images, tables, and graphs all mixed together. Current AI systems, even advanced ones that can work with both text and images, have trouble accurately understanding and summarizing these posters. This is a problem because being able to quickly understand and summarize scientific information is important for researchers and students.

What's the solution?

The researchers created PosterSum, a dataset of over 16,000 scientific posters along with their summaries. They used this dataset to test how well current AI systems can summarize posters. They also came up with a new method called 'Segment & Summarize' that breaks down the poster into smaller parts before summarizing it. This new method performed better than existing AI systems at summarizing posters.

Why it matters?

This research matters because it helps push forward the development of AI that can better understand complex scientific information. Being able to quickly and accurately summarize scientific posters could help researchers stay up-to-date with new findings more easily. It could also help students learn from scientific materials more effectively. As AI becomes more involved in scientific research and education, improving its ability to understand and summarize complex information becomes increasingly important.

Abstract

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.

View Paper