Trading Inference-Time Compute for Adversarial Robustness
Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese
2025-02-03

Summary
This paper talks about how giving AI models more time to think during testing can make them better at resisting tricks designed to fool them. The researchers focused on two specific AI models called OpenAI o1-preview and o1-mini, which are good at reasoning.
What's the problem?
AI models can sometimes be tricked or fooled by what are called 'adversarial attacks'. These are like clever puzzles designed to make the AI give wrong answers. Usually, to protect against these tricks, AI models need special training, which can be complicated and time-consuming.
What's the solution?
Instead of using special training, the researchers tried something simpler: they just gave the AI models more time to think when answering questions during testing. They found that in many cases, the longer the AI was allowed to think, the better it became at avoiding these tricky attacks. Interestingly, they didn't need to change how the AI was trained or tell it about the tricks; just giving it more time to reason was often enough to make it more resistant.
Why it matters?
This matters because it suggests a potentially easier way to make AI systems more reliable and harder to trick. Instead of always needing complex training methods, sometimes just letting the AI take its time to think carefully could be enough to protect against many types of attacks. This could make it simpler and cheaper to create more trustworthy AI systems. However, the researchers also found that this method doesn't work for all types of tricks, so there's still more to learn about how to make AI truly robust against all kinds of attacks.
Abstract
We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.