RDMA Point-to-Point Communication for LLM Systems

Nandor Licker, Kevin Hu, Vladimir Zaytsev, Lequn Chen

2025-11-07

RDMA Point-to-Point Communication for LLM Systems

Summary

This paper introduces TransferEngine, a new system designed to speed up communication between different parts of large AI models, especially those that are split up and run on multiple computers.

What's the problem?

Currently, large language models are becoming so big that they need to be broken down and run across many different machines. This requires incredibly fast and flexible communication between those machines, but existing communication methods are often tied to specific hardware, like a particular type of network card. This creates a problem because it makes it hard to move these models between different computer systems and limits how well they can work with existing AI software.

What's the solution?

TransferEngine solves this by creating a universal 'translator' for different network cards. It allows software to communicate with network hardware in a standard way, regardless of the specific brand or type. It uses a fast communication method called 'WriteImm' and a way to track when data has been sent, all without needing strict ordering of data transfer. This allows it to manage multiple network cards per computer efficiently and achieve very high speeds, up to 400 gigabits per second.

Why it matters?

This work is important because it makes it easier to build and deploy extremely large AI models. By removing the dependency on specific hardware, TransferEngine makes these models more portable and allows them to take full advantage of the available network bandwidth. The paper demonstrates this with real-world examples like speeding up model training and improving the performance of models that are split across multiple machines, showing it can compete with and even surpass existing solutions.

Abstract

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs to expose a uniform interface. TransferEngine exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase TransferEngine through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in.

View Paper