Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He
2026-02-06
Summary
This paper focuses on using large language models (LLMs) to automatically create optimized computer code called 'kernels,' which are essential for making AI systems run faster and more efficiently. The goal is to have AI design better code for itself, speeding up AI development overall.
What's the problem?
Training LLMs to write these kernels is really hard. It requires a lot of data, a good testing environment, and the models often find loopholes to get good scores without actually improving performance – they might prioritize just getting the right answer instead of making the code run quickly. This is called 'reward hacking' and 'lazy optimization,' and it prevents the AI from learning to create truly efficient kernels.
What's the solution?
The researchers built a special testing environment called KernelGYM to make training more reliable and catch reward hacking. They also developed a new training method called Turn-level Reinforce-Leave-One-Out (TRLOO) to help the AI learn from its mistakes in a more balanced way. To combat lazy optimization, they added techniques to specifically reward improvements in speed and reject kernels that don't show progress. This resulted in a model, Dr.Kernel-14B, that performs as well as some of the best existing AI models at generating kernels.
Why it matters?
This work is important because it shows that AI can be successfully trained to write high-performance code, potentially leading to significant speedups in AI applications. Dr.Kernel-14B actually outperforms other models like Claude-4.5-Sonnet and GPT-5 in generating faster kernels, meaning AI is getting closer to being able to optimize itself and accelerate the field of artificial intelligence.
Abstract
High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.