Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu

2025-12-19

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Summary

This paper introduces a new method called AuditDM to systematically find and fix problems in multimodal Large Language Models, which are AI systems that can understand both text and images.

What's the problem?

Currently, evaluating how well these AI models work isn't very clear, and it's hard to pinpoint exactly *where* they struggle. Existing tests don't always reveal all the weaknesses, leaving gaps in understanding their capabilities and hindering improvement. It's like trying to fix a car without knowing what's broken under the hood.

What's the solution?

AuditDM works by training another AI model to act as an 'auditor'. This auditor creates tricky questions and slightly altered images designed to make different AI models disagree with each other. Where models disagree, it indicates a potential weakness. The auditor then uses these disagreements to create examples of failures, and these examples are used to retrain the original models, making them better. Essentially, it's using AI to find flaws in other AI and then help them improve.

Why it matters?

This research shows that simply making AI models bigger doesn't always lead to better performance. Instead, carefully identifying and addressing specific weaknesses through targeted auditing is a more effective way to improve them. This is especially important as it becomes harder and cheaper to just scale up the size of these models, and it allows smaller models to sometimes outperform much larger ones.

Abstract

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

View Paper