Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Franz Louis Cesista

2024-06-24

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Summary

This paper discusses the results of the 2nd Multimodal Foundation Models (MMFM) Challenge, focusing on a new method called Multimodal Structured Generation. This method enhances how advanced models understand and generate structured information, particularly for tasks like document understanding.

What's the problem?

Multimodal Foundation Models have been successful in many areas, but they struggle with specific tasks such as understanding documents. They also require a lot of computing power and time to fine-tune, making them less practical compared to simpler models that focus on one type of data, like just text or just images.

What's the solution?

The authors proposed a framework called Multimodal Structured Generation, which helps these models produce more organized and understandable outputs. By constraining the models' responses, they encourage them to think through the information before generating structured outputs that can be easily used by other applications. This approach was tested in a competition where it performed well, achieving high scores in both phases of the challenge.

Why it matters?

This research matters because it shows that with smart engineering and design, simpler methods can outperform more complex ones. It highlights the potential for improving document understanding tasks without needing extensive resources, making it easier for developers to implement effective solutions in real-world applications.

Abstract

Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use. All of our scripts, deployment steps, and evaluation results can be accessed in https://github.com/leloykun/MMFM-Challenge

View Paper