Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng
2025-12-09
Summary
This paper introduces a new method called Native Parallel Reasoner, or NPR, which helps large language models (LLMs) think through problems in a truly parallel way, like having multiple thoughts at once, instead of one after another.
What's the problem?
Large language models are really good at many things, but when it comes to complex reasoning, they often process information step-by-step, like following a single line of thought. This sequential processing is slow and limits their ability to tackle complicated problems efficiently. Existing attempts to make them think in parallel often aren't truly parallel – they still rely on the model essentially predicting the next step in a sequence.
What's the solution?
The researchers developed NPR, which uses a three-part approach to teach LLMs to reason in parallel without needing a human teacher. First, it uses a training process where the model gradually learns to identify the best way to break down a problem. Second, it employs a new algorithm, PAPO, that lets the model experiment with different ways of branching out its thinking and learn what works best through trial and error. Finally, they improved the underlying system, called SGLang, to handle the demands of this parallel processing, making it stable and able to handle large-scale training.
Why it matters?
This work is important because it shows a way to make LLMs significantly faster and more capable at reasoning. By achieving genuine parallel execution, NPR improves performance on reasoning tasks by up to 24.5% and speeds up processing by up to 4.6x. This could lead to more powerful AI systems that can solve complex problems more efficiently, moving beyond simply mimicking sequential thought processes.
Abstract
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.