MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yuanxing Zhang, Pengfei Wan, Haotian Wang, Wenjing Yang

2025-05-28

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in
Video Scenarios

Summary

This paper talks about how well advanced AI models can read and understand text that appears in videos, using a new test called MME-VideoOCR.

What's the problem?

The problem is that it's really hard for AI to accurately read words in videos because things like motion blur, changes over time, and special effects make the text hard to see and understand. These challenges also make it tough for the AI to connect what it reads with what's happening in the video.

What's the solution?

The researchers created the MME-VideoOCR benchmark to test how good these AI models are at reading and understanding text in videos. They found that the models still have a lot of trouble, especially when the text moves or changes, and that the AI sometimes guesses based on language patterns instead of actually reading the words.

Why it matters?

This matters because being able to accurately read text in videos is important for things like making videos more accessible, searching for information, and helping people with disabilities. The research shows where AI needs to get better so it can be more useful in real-world video situations.

Abstract

MLLMs achieve modest accuracy in video OCR due to motion blur, temporal variations, and visual effects; MME-VideoOCR benchmark reveals limitations in spatio-temporal reasoning and language bias.

View Paper