Efficiently Serving LLM Reasoning Programs with Certaindex

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, Hao Zhang

2024-12-31

Efficiently Serving LLM Reasoning Programs with Certaindex

Summary

This paper talks about Dynasor, a new system designed to improve how large language models (LLMs) handle reasoning tasks by optimizing the use of computing resources during the process.

What's the problem?

As LLMs are used for complex reasoning tasks like solving math problems or generating code, they often require a lot of computing power and time. Current systems do not adapt well to the varying difficulty of different tasks, leading to inefficient use of resources and slower response times. This makes it hard to get quick and accurate answers, especially for more challenging queries.

What's the solution?

To solve this problem, the authors developed Dynasor, which tracks and manages how computing resources are allocated for different reasoning tasks. It uses a method called Certaindex to measure how confident the model is about its reasoning progress. Dynasor allocates more computing power to difficult queries and less to easier ones, and it can even stop unpromising queries early. This dynamic approach helps balance accuracy, speed, and cost-effectiveness during the reasoning process.

Why it matters?

This research is important because it enhances the efficiency of AI systems that rely on LLMs for reasoning tasks. By optimizing how these models use computing resources, Dynasor can provide faster responses and better performance across various applications, making AI tools more effective in areas like education, programming, and data analysis.

Abstract

The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets. We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.

View Paper