InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou

2025-02-24

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal
Models via Human Feedback

Summary

This paper talks about InterFeedback, a new way to test how well AI models that can understand different types of information (like text, images, and video) can interact with humans and learn from their feedback.

What's the problem?

Current tests for AI models don't check how well they can have back-and-forth conversations with humans and improve based on feedback. This is important for creating AI assistants that can truly help people in everyday situations.

What's the solution?

The researchers created InterFeedback, a system that can test any AI model's ability to interact with humans. They also made two special datasets: InterFeedback-Bench, which uses existing tough problems to test 10 different AI models, and InterFeedback-Human, with 120 new examples to test the most advanced AI models like OpenAI-o1 and Claude-3.5-Sonnet.

Why it matters?

This matters because it shows that even the best AI models today aren't very good at understanding and using human feedback to get better. The study found that top models like OpenAI-o1 could only improve their answers based on feedback less than half the time. This research helps us see where AI needs to improve to become truly helpful assistants that can learn and adapt through conversations with humans.

Abstract

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce <PRE_TAG>InterFeedback-Bench</POST_TAG> which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present <PRE_TAG>InterFeedback-Human</POST_TAG>, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results show that even state-of-the-art LMM (like OpenAI-o1) can correct their results through human feedback less than 50%. Our findings point to the need for methods that can enhance the LMMs' capability to interpret and benefit from feedback.

View Paper