DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning

Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Siva Reddy

2025-04-11

DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning

Summary

This paper talks about studying how the DeepSeek-R1 AI ‘thinks’ by analyzing its step-by-step reasoning process, like watching someone solve a puzzle out loud, and uncovering both strengths and weaknesses in its problem-solving approach.

What's the problem?

While DeepSeek-R1 shows advanced reasoning by breaking problems into steps, it sometimes overthinks (hurting performance), gets stuck on old ideas, and has safety risks compared to simpler AI models.

What's the solution?

Researchers created ‘Thoughtology’ to map how the AI reasons, identify where it struggles (like overthinking or repeating itself), and test how cultural biases or unsafe ideas might sneak into its answers.

Why it matters?

Understanding how AI reasons helps improve its reliability for tasks like tutoring or research, while addressing risks like biased decisions or unsafe suggestions that could harm users.

Abstract

Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly "thinking" about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

View Paper