Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

2025-12-10

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Summary

This paper introduces a new method called Arbitrage to speed up how large language models, like those used for complex reasoning, generate answers. It focuses on making the process more efficient without sacrificing accuracy.

What's the problem?

Large language models are really good at thinking through problems step-by-step, but this takes a lot of computing power and time. A common technique to speed things up involves using a faster, less accurate model to propose answers, and then having a more powerful model check them. However, the faster model often gets rejected even when its steps are logically correct, leading to wasted effort as the system keeps redoing those steps. Existing methods that check entire reasoning steps instead of individual tokens still struggle with this issue of unnecessary re-generation.

What's the solution?

The researchers developed Arbitrage, a system that intelligently decides when to trust the faster model and when to rely on the more powerful one. Instead of a simple 'yes' or 'no' check, Arbitrage uses a small 'router' model to predict if the powerful model is likely to come up with a significantly better step. This allows the system to dynamically choose the best step, minimizing wasted computation and maximizing efficiency. It's like having a judge who understands when a quick decision is good enough and when a more careful review is needed.

Why it matters?

This work is important because it makes complex reasoning tasks with large language models much faster and more practical. By reducing the time it takes to get an answer without losing accuracy, Arbitrage opens the door to using these powerful models in more real-time applications and on less powerful hardware, making them more accessible and useful.

Abstract

Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to sim2times at matched accuracy.

View Paper