ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models
Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin
2025-12-10
Summary
This paper introduces ThreadWeaver, a new system designed to make large language models (LLMs) think and respond faster without losing accuracy. It focuses on improving how LLMs solve complex problems by allowing them to work on different parts of the problem simultaneously.
What's the problem?
Large language models are really good at reasoning, but they can be slow because they typically solve problems step-by-step, one thought after another. While breaking down problems into smaller steps helps with accuracy, it adds a lot of time, especially for difficult tasks. Existing attempts to speed things up by having the model think through multiple possibilities at once either aren't very accurate or require special software to run, making them hard to use in practice.
What's the solution?
ThreadWeaver tackles this problem with a three-part approach. First, it creates a large dataset of example problems with multiple possible solution paths, showing the model how to think in parallel. Second, it’s designed to work with standard LLM software without needing any changes to how the model processes information. Finally, it uses a learning process that teaches the model to balance getting the right answer with the benefits of working on different parts of the problem at the same time. Essentially, it learns *when* to think in parallel and *how* to do it effectively.
Why it matters?
This work is important because it offers a way to significantly speed up large language models without sacrificing their ability to reason accurately. By achieving both speed and accuracy, ThreadWeaver opens the door to using these powerful models in more real-time applications and makes them more practical for solving complex problems quickly. It demonstrates a new balance between efficiency and performance in LLMs.
Abstract
Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.