Great Models Think Alike and this Undermines AI Oversight

Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

2025-02-07

Great Models Think Alike and this Undermines AI Oversight

Summary

This paper talks about how the increasing similarity between advanced AI language models can create problems for using AI to oversee other AI systems, a concept known as 'AI Oversight'.

What's the problem?

As AI language models get smarter, it's becoming harder for humans to check their work and supervise them. People hope that we can use other AI models to do this job, but there's a catch: when AI models become very advanced, they start to make similar mistakes. This similarity can lead to biased judgments and correlated failures, which makes it risky to rely on AI to oversee other AI systems.

What's the solution?

The researchers created a way to measure how similar different AI models are by looking at the mistakes they make. They used this to show that when an AI model judges another AI model, it tends to favor models that are similar to itself. They also found that when training AI using annotations from other AI, it works best when the 'teacher' AI and the 'student' AI have different strengths. However, as AI models get more advanced, their mistakes are becoming more alike, which could cause problems for AI Oversight.

Why it matters?

This research matters because as we rely more on AI systems, we need ways to make sure they're working correctly. If we can't trust AI to oversee other AI because they're too similar, it could lead to uncaught mistakes or biased decisions in important areas. This study helps us understand the risks and limitations of using AI for oversight, which is crucial for developing safer and more reliable AI systems in the future.

Abstract

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

View Paper