Reasoning Shift: How Context Silently Shortens LLM Reasoning
Gleb Rodionov
2026-04-02
Summary
This paper investigates how well large language models (LLMs) maintain their reasoning abilities when faced with distractions or complex situations, even if they still get the right answer.
What's the problem?
LLMs are getting really good at complex reasoning, like working through long problems step-by-step and checking their own work. However, it's unclear if this reasoning stays consistent when the models are given extra, unnecessary information, are asked to do multiple things in a conversation, or when a problem is part of a bigger task. The core issue is whether these models simplify their reasoning process when things get more complicated, and if that simplification impacts their ability to solve harder problems.
What's the solution?
Researchers tested several LLMs in three different scenarios: giving them problems with a lot of irrelevant text, having them handle multiple independent tasks in a conversation, and presenting problems as smaller parts of a larger, more complex task. They found that the models tended to produce significantly shorter reasoning steps when presented with these extra conditions, meaning they skipped details they would normally include. This shortening also meant they did less self-checking and seemed less aware of their own uncertainty. They analyzed how much the reasoning changed in each situation.
Why it matters?
This research highlights that while LLMs can often still *solve* problems even with distractions, their *way* of solving them can change, potentially making them less reliable when facing truly difficult challenges. It emphasizes the need to focus not just on getting the right answer, but also on understanding how robust and consistent an LLM’s reasoning process is, and how well they manage information in complex environments. This is important for building trustworthy AI agents.
Abstract
Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.