Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention
Rakshith Vasudev, Melisa Russak, Dan Bikel, Waseem Alshikh
2026-02-06
Summary
This research investigates whether using another AI model to check and correct the outputs of a large language model (LLM) actually makes things better in real-world use, and finds that it doesn't always help and can even make things worse.
What's the problem?
We often assume that having an 'AI critic' look over an LLM's work will improve its reliability, but it's not clear how these critics perform when actually used. The study found that even a very accurate AI critic – one that correctly identifies errors 94% of the time when tested beforehand – can surprisingly cause a significant drop in performance for some LLMs, while barely affecting others. This shows that just knowing how accurate a critic is isn't enough to know if using it is a good idea.
What's the solution?
The researchers discovered that AI critics face a 'disruption-recovery tradeoff'. They can fix outputs that were going to be wrong, but they can also mess up outputs that were already correct. To address this, they developed a quick pre-deployment test using a small sample of 50 tasks. This test helps predict whether using the critic will likely improve or worsen performance *before* fully implementing it. The test worked well, correctly predicting performance drops on tasks the LLM usually does well and modest improvements on tasks where the LLM often fails.
Why it matters?
The main takeaway is that it's crucial to figure out *when not* to use an AI critic. This framework helps prevent situations where adding a critic actually makes an LLM perform worse, which is especially important before deploying these models in real-world applications where reliability is key.
Abstract
Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.