DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang

2026-02-26

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Summary

This paper focuses on speeding up how large language models (LLMs) process information over multiple turns, like in a chatbot conversation.

What's the problem?

When LLMs have a conversation, they need to remember what was said before. This 'memory' is stored in something called the KV-Cache, and it gets really big, really fast. Current systems struggle because loading this huge KV-Cache from storage to the computers doing the work creates a bottleneck. Specifically, the computers that *start* the conversation (prefill engines) get overloaded trying to grab the KV-Cache, while the computers that *continue* the conversation (decoding engines) sit around doing nothing, wasting potential processing power.

What's the solution?

The researchers created a system called DualPath that fixes this by adding a second way to load the KV-Cache. Instead of *only* sending the KV-Cache to the computers starting the conversation, DualPath also sends it directly to the computers continuing the conversation. These computers can then quickly share the necessary parts with the starting computers using a fast connection. They also developed a smart system to balance the workload between the starting and continuing computers.

Why it matters?

This is important because it significantly speeds up LLM performance, making chatbots and other conversational AI much more responsive. The paper shows improvements of up to 1.87 times faster for processing data and 1.96 times faster for real-time use, without sacrificing quality or reliability. This means better user experiences and the ability to handle more users at once.

Abstract

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87times on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96times without violating SLO.

View Paper