CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum

2025-12-03

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Summary

This paper introduces CUDA-L2, a new system that uses the power of large language models and reinforcement learning to automatically make computer code for fast matrix multiplication run even faster.

What's the problem?

Matrix multiplication is a fundamental operation in many computer programs, especially those dealing with things like artificial intelligence and scientific computing. Getting it to run quickly is crucial, and traditionally, this requires expert programmers to carefully tune the code for specific hardware. This tuning process is incredibly time-consuming and difficult, even for experts, and existing highly-optimized libraries aren't always the best for every situation.

What's the solution?

The researchers created CUDA-L2, which essentially lets a computer program 'learn' the best way to write matrix multiplication code. They used a large language model to help guide a reinforcement learning algorithm. The algorithm tries out different code configurations, and the language model helps it explore promising options. The 'reward' for the algorithm is simply how fast the resulting code runs. By repeatedly trying different things and learning from the results, CUDA-L2 automatically finds code that outperforms existing libraries like torch.matmul, cuBLAS, and cuBLASLt, even in realistic scenarios where tasks are run at random times.

Why it matters?

This work is important because it shows that even highly-optimized code, like that used for matrix multiplication, can still be improved using AI. It suggests that we can automate the process of code optimization, potentially leading to significant performance gains in many applications without needing constant manual intervention from human programmers. It demonstrates a new way to leverage the capabilities of large language models beyond just text generation, applying them to the complex task of software optimization.

Abstract

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

View Paper